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Foreword! 


This book reports on preparatory work toward an important policy objective of the 
European Commission: turning Europe into a safe and privacy-respecting society 
that thrives by extracting maximum value from the data it produces and reuses, be it 
in support of important societal goals or as fuel for innovation in productive 
activities. 

Our plans for Europe are described in our July 2014 Communication on a data- 
driven economy, where we spell out a three-pronged approach addressing regula- 
tory issues (such as personal data protection and data ownership), framework 
conditions (such as data standards and infrastructures), and community building. 

The first visible step of our community building efforts is a massive commitment 
(534 million Euros by 2020), which we signed in October 2014, to enter in a Public 
Private Partnership with the Big Data Value Association (BDVA): with the help 
from industrial parties and groups that represent relevant societal concerns (such as 
privacy), we intend to identify and solve technical problems and framework 
conditions (such as skill development) that stand in the way of European companies 
increasing their productivity and innovativeness by making efficient use of data 
technologies. By shouldering some of the financial risk of these activities, we plan 
to leverage even more massive European investment: for every public Euro 
invested by the Commission, our industry partners have committed to investing 
four private Euros. 

Naturally, this requires some well-informed and clear thinking on which 
domains of data-related activities hold the greatest promise for a safe and prosper- 
ous Europe and on how we can avoid wasteful duplication in the development of 
data infrastructures, formats, and technologies. The book you are holding in your 
hands gives you a first lay of the land: it results from more than two years of work 


' The views expressed in the article are the sole responsibility of the author and in no way represent 
the view of the European Commission and its services. 
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(also funded by the European Union) aimed at identifying issues and opportunities 
that are specifically European in character. 

We fully expect that many of these results will be included and further elabo- 
rated over the years in the strategic planning of the BDVA, and we are happy to 
share them in this book with the broader public. 

We hope that you will find them informative and that they will help you shape 
your own thinking on what your expectations and active role might be in a better 
Europe that has taught itself to run on data. 


Luxembourg City, Luxembourg Giuseppe Abbamonte 
October 2015 Directorate G Media & Data 
European Commission DG CONNECT 


Foreword 


Data has become a factor just as important to production as labor, capital, and land. 
For the new value creators in today’s technology start-ups, little capital and office 
space is required. Both can be almost free when a firm is growing 1 % per day, on 
any metric. But without talent, and without the right kind of data, such a takeoff is 
highly improbable. 

We see the same forces at play in SAP’s Innovation Center Network. Attracting 
the right talent was critical to establish the first Innovation Center in Potsdam. And 
having large, real-world datasets from customers and co-innovation partners is 
critical to many of our innovations. To make a difference in cancer treatment and 
research with our Medical Research Insights app, we critically depended on data- 
driven collaboration with the National Center for Tumor Diseases. The same holds 
for incubating SAP’s new sports line of business by co-innovating with the German 
national soccer team based on real-time sensor feeds from their players. And it 
holds true for SAP’s many initiatives in the Internet of Things, like the predictive 
maintenance apps with John Deere and Kaeser. 

The Big Data Value Association (BDVA) is poised to make a difference both for 
data availability and for talent. By bringing together businesses with leading 
researchers, software and hardware partners, and enabling co-innovation around 
large, real-world datasets, BDVA can help lower the data barrier. And helping 
educate the next generation of thought leaders, especially in data science, computer 
science, and related fields, BDVA can help increase the supply of talent. Both are 
critical so Europe can begin to lead, not follow, in creating value from big data. 

By clearly defining the opportunity in big data, by examining the big data value 
chain, and by deep-diving into industry sector applications, this book charts a way 
forward to new value creation and new opportunities from big data. Decision makers, 
policy advisors, researchers, and practitioners on all levels can benefit from this. 


Jiirgen Miiller 
Berlin, Germany Vice President, SAP Innovation Center Network 
Brussels, Belgium President, Big Data Value Association 


Preface 


Welcome to our humble contribution to the huge universe of big data literature. We 
could ironically say there are almost as many books, leaflets, conferences, and 
essays about the possibilities of big data as data itself to be collected, curated, 
stored, and analyzed, yet a single zettabyte of useful data is an amount of informa- 
tion we are currently incapable of writing, and as described in this book, 
16 zettabytes of data are waiting for us in 2020. 

However, according to many research and industrial organisations, this contri- 
bution is actually not that humble and is even unique in many senses. 

First of all, this book is not just another approach made by a single player looking 
down from a corner of the world. It is the compendium of more than 2 years of work 
performed by a set of major European research centers and industries. It is the 
compilation and processed synthesis of what we all have done, prepared, foreseen, 
and anticipated in many aspects of this challenging technological context that is 
becoming the major axis of the new digitally transformed business environment. 

But the most important part of the book is you, the reader. It is commonly said 
that “a map is useless for the one who does not know where to go.” This book is a 
map. An immediate goal of this book is to become a “User’s Manual” for those who 
want to blaze their own trail in the big data jungle. But it can also be used as a 
reference book for those experts who are sailing their own big data ship and want to 
clarify specific aspects on their journey. 

You reader, either trailblazer or old sailor, have to make your own way through 
the book. In this map, you will not only find answers and discussions about legal 
aspects of big data but also about social impact and education needs and require- 
ments. You will also find business perspectives, discussions, and estimations of big 
data actuations in the different sectors of the economy, ranging from the public 
sector to the retailing actors. And you will also find technological discussions about 
the different stages of data and how to address these emerging technologies. 

We worked on all these matters within the context of a European Commission 
project called BIG (Big Data Public Private Forum), which was an enormous 
challenge and one that we reckon has been successfully achieved and accomplished. 
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The book is divided into four parts: Part I “The Big Data Opportunity” explores 
the value potential of big data with a particular focus on the European context. 
Chapter | sets the scene for the value potential of big data and examines the legal, 
business, and social dimensions that need to be addressed to deliver on its promise. 
Next, Chap. 2 briefly introduces the European Commission’s BIG project and its 
remit to establish a big data research roadmap for Horizon 2020 to support and 
foster research and innovation in the European Research Area. 

Part II “The Big Data Value Chain” details the complete big data lifecycle from 
a technical point of view, ranging from data acquisition, analysis, curation, and 
storage to data usage and exploitation. Chapter 3 introduces the core concepts of the 
big data value chain. The next five chapters detail each stage of the data value chain, 
including a state-of-the-art summary, emerging use cases, and key open research 
questions. Chapter 4 provides comprehensive coverage of big data acquisition, 
which is the process of gathering, filtering, and cleaning data before it is put in a 
data warehouse or any other storage solution for further processing. Chapter 5 
discusses big data analysis that focuses on transforming raw acquired data into a 
coherent, usable resource suitable for analysis to support decision-making and 
domain-specific usage scenarios. Chapter 6 investigates how the emerging big 
data landscape is defining new requirements for data curation infrastructures and 
how big data curation infrastructures are evolving to meet these challenges. 
Chapter 7 provides a concise overview of big data storage systems that are capable 
of dealing with high velocity, high volumes, and high varieties of data. Finally, 
Chap. 8 examines the business goals that need access to data and their analyses and 
integration into business decision-making in different sectors. 

Part HI “Usage and Exploitation of Big Data” illustrates the value creation 
possibilities of big data applications in various sectors, including industry, 
healthcare, finance, energy, media, and public services. Chapter 9 provides the 
conceptual background and overview of big data-driven innovation in society, 
highlighting factors and challenges associated with the adequate diffusion, uptake, 
and sustainability of big data-driven initiatives. The remaining chapters describe the 
state of the art of big data in different sectors, examining enabling factors, industrial 
needs, and application scenarios and distilling the analysis into a comprehensive set 
of requirements across the entire big data value chain. Chapter 10 details the wide 
variety of opportunities for big data technologies to improve overall healthcare 
delivery. Chapter 11 investigates the potential value to be gained from big data by 
government organizations by boosting productivity in an environment with signif- 
icant budgetary constraints. Chapter 12 explores the numerous advantages of big 
data for financial institutions. Chapter 13 examines the domain-specific big data 
technologies needed for cyber-physical energy and transport systems, where the 
focus needs to move beyond big data to smart data technologies. Chapter 14 
discusses the media and entertainment sectors which are in many respects an 
early adopter of big data technologies because it enables them to drive digital 
transformation, exploiting more fully not only data which was already available 
but also new sources of data from both inside and outside the organization. 
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Finally, Part IV “A Roadmap for Big Data Research” identifies and prioritizes 
the cross-sectorial requirements for big data research and outlines the most urgent 
and challenging technological, economic, political, and societal issues for big data 
in Europe. Chapter 15 details the process used to consolidate the big data require- 
ments from different sectors into a single prioritized set of cross-sector require- 
ments that were used to define the technology policy, business, and society 
roadmaps together with action recommendations. Chapter 16 describes the 
roadmaps in the areas of technology, business, policy, and society. The chapter 
introduces the Big Data Value Association (BDVA) and the Big Data Value 
contractual Public Private Partnership (BDV cPPP) which provide a framework 
for industrial leadership, investment, and commitment of both the private and 
public sides to build a data-driven economy across Europe. 

We invite you to read this book at your convenience, and we wish that you will 
enjoy it as much as we have whilst preparing its contents. 


Ciudad Real, Spain José María Cavanillas 
Galway, Ireland Edward Curry 
Saarbrücken, Germany Wolfgang Wahlster 


October 2015 
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Part I 
The Big Data Opportunity 


Chapter 1 
The Big Data Value Opportunity 


José Maria Cavanillas, Edward Curry, and Wolfgang Wahlster 


1.1 Introduction 


The volume of data is growing exponentially, and it is expected that by 2020 there 
will be more than 16 zettabytes (16 Trillion GB) of useful data (Turner et al. 2014). 
We are on the verge of an era where every device is online, where sensors are 
ubiquitous in our world generating continuous streams of data, where the sheer 
volume of data offered and consumed on the Internet will increase by orders of 
magnitude, where the Internet of Things will produce a digital fingerprint of our 
world. 

Big data is the emerging field where innovative technology offers new ways of 
extracting value from the tsunami of new information. The ability to effectively 
manage information and extract knowledge is now seen as a key competitive 
advantage. Many organizations are building their core business on their ability to 
collect and analyse information to extract business knowledge and insight. Big data 
technology adoption within industrial sectors is not a luxury but an imperative need 
for most organizations to survive and gain competitive advantage. 

This chapter explores the value potential of big data with a particular focus on 
the European context and identifies the positive transformational potential of big 
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data within a number of key sectors. It discusses the need for a clear strategy to 
increase the competitiveness of European industries in order to drive innovation 
and competitiveness. Finally the chapter describes the key dimensions, including 
skills, legal, business, and social, that need to be addressed in a European Big Data 
Ecosystem. 


1.2 Harnessing Big Data 


The impacts of big data go beyond the commercial world; within the scientific 
community, the explosion of available data is producing what is called Data 
Science (Hey et al. 2009), a new data-intensive approach to scientific discovery. 
The capability of telescopes or particle accelerators to generate several petabytes of 
data per day is producing different problems in terms of storage and processing. 
Scientists do not have off-the-shelf solutions ready to analyse and properly compare 
disperse and huge datasets. Enabling this vision will require innovative big data 
technologies for data management, processing, analytics, discovery, and usage 
(Hey et al. 2009). 

Data has become a new factor of production, in the same way as hard assets and 
human capital. Having the right technological basis and organizational structure to 
exploit data is essential. Europe must exploit the potential of big data to create value 
for society, citizens, and businesses. However, from an industry adoption point of 
view, Europe is lagging behind the USA in big data technologies and is not taking 
advantage of the potential benefits of big data across its industrial sectors. A clear 
strategy is needed to increase the competitiveness of European industries through 
big data. While US-based companies are widely recognized for their works in big 
data, very few European organizations are known for their works in the field. This 
currently makes Europe dependent on technologies coming from outside and may 
prevent European stakeholders from taking full advantage of big data technology. 
Being competitive in big data technologies and solutions will give Europe a new 
source of competitiveness and the potential to foster a new data-related industry 
that will generate new jobs. 

Addressing the current problems requires a holistic approach, where technical 
activities work jointly with business, policy, and society aspects. Europe needs to 
define actions that support faster deployment and adoption of the technology in real 
cases. Support is needed not only to “build” the technology but also to “grow” the 
ecosystem that makes innovation possible. There are many technical challenges 
that will require further research, but this work has to be accompanied by a 
continuous understanding of how big data technologies support both business and 
societal challenges. How can data-driven innovation be integrated into an organi- 
zation’s processes, cultural values, and business strategy? Europe has a track record 
in joint research efforts, as well as strength in converging policies or eliminating 
adoption barriers. There is an opportunity to build upon these and other European 
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strengths in order to enable a vision where big data contributes to making Europe 
the most competitive economy in the world in 2020. 


1.3 A Vision for Big Data in 2020 


The Information and Communications Technology (ICT) sector is directly respon- 
sible for 5 % of European GDP, with a market value of 660 billion euros annually; it 
also contributes significantly to overall productivity growth (20 % directly from the 
ICT sector and 30 % from ICT investments). Big data solutions can contribute to 
increase European competitiveness by delivering value adding tools, applications, 
and services. One estimate for 2020 puts the potential of big and open data to 
improve the European GDP by 1.9 %, an equivalent of one full year of economic 
growth in the EU (Buchholtz et al. 2014). International Data Corporation (IDC) 
forecasts that the big data technology and services market will grow at a 27 % 
compound annual growth rate (CAGR) to $32.4 billion through 2017 (Vesset 
et al. 2013). 

The European Commission launched in March 2010 the Europe 2020 Strategy 
(European Commission 2010) to exit the crisis and prepare the EU economy for the 
next challenges in terms of productivity, economy, and social cohesion. The Digital 
Agenda for Europe is one of the seven flagship initiatives of the Europe 2020 
Strategy; it defines the key enabling role that the use of ICT will have to play if 
Europe wants to succeed in its ambitions for 2020. The paramount importance of 
big data was recognized by including a specific topic in the Digital Agenda to get 
maximum benefit from existing data and specifically the need to open up public 
data resources for re-use. As then EU Commissioner Kroes stated, “Big Data is the 
new Oil” that can be managed, manipulated, and used like never before thanks to 
high-performance digital tools, making big data the fuel for innovation. 


1.3.1 Transformation of Industry Sectors 


The potential for big data is expected to impact all sectors, from healthcare to 
media, from energy to retail (Manyika et al. 2011). The positive transformational 
potential has already been identified in a number of key sectors. 


¢ Healthcare: In the early twenty-first century, Europe is an ageing society that 
places significant demands on its healthcare infrastructure. There is an urgent 
need for improvement in efficiency of the current healthcare system to make it 
more sustainable. The application of big data has significant potential in the 
sector with estimated savings in expenditure at 90 billion euros from national 
healthcare budgets in the EU (Manyika et al. 2011). Clinical applications of big 
data range from comparative effectiveness research where the clinical and 
financial effectiveness of interventions is compared to the next generation of 
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clinical decision support systems that make use of comprehensive heterogeneous 
health datasets as well as advanced analytics of clinical operations. Healthcare 
R&D applications include predictive modelling, statistical tools, and algorithms 
to improve clinical trial design, personalized medicine, and analysing disease 
patterns. 

Public Sector: Europe’s public sector accounts for almost half of GDP and can 
benefit significantly from big data to gain efficiency in administrative processes. 
Big data could reduce the costs of administrative activities by 15-20 %, creating 
the equivalent of 150 billion euros to 300 billion euros in new value (OECD 
2013). Potential benefits in the public sector include improved transparency via 
open government and open data, improved public procurement, enhanced allo- 
cation of funding into programmes, higher quality services, increased public 
sector accountability, and a better-informed citizen. Crucial to the future is the 
definition of policies to share data across government agencies and to inform 
citizens about the trade-offs between the privacy and security risks of sharing 
data and the benefits they can gain. Big data will also change the relationship 
between citizens and government by empowering citizens to understand political 
and social issues in new transparent ways, enabling them to engage with local, 
regional, national, and global issues through participation. 

Finance and Insurance: There are a number of ways for financial service 
companies to achieve business advantages by mining and analysing data. 
These include enhanced retail customer service, detection of fraud, and improve- 
ment of operational efficiencies. Big data can be used to identify exposure in real 
time across a range of sophisticated financial instruments like derivatives. 
Predictive analysis of both internal and external data results in better, proactive 
management of a wide range of issues from credit and operational risk (e.g. fraud 
and reputational risk) to customer loyalty and profitability. A challenge for the 
financial sector is how to use the breadth and depth of data available to satisfy 
more demanding regulators while also providing personalized services for their 
customers. 

Telecom, Media, and Entertainment: Big data analysis and visualization 
techniques can enable the effective discovery and delivery of media content 
enabling users to dynamically interact with new media and content across 
multiple platforms. The domain of personal location data offers the potential 
for new value creation with applications, including location-based content 
delivery for individuals, smart personalized content routing, automotive 
telematics, mobile location-based services, and geo-targeted advertising. 
Retail: Significant opportunities for using big data technologies reside in the 
interactions between retailers and consumers. Data is playing an increasing role 
as consumers search, research, compare, buy, and obtain support online and the 
products sold by retailers increasingly generate their own data footprints. Big 
data can increase productivity and efficiency resulting in a potential 60 % 
increase in retailers’ operating margins (Manyika et al. 2011). Big data can 
impact retail in areas such as marketing: cross-selling, location-based marketing, 
in-store behaviour analysis, customer micro-segmentation, customer sentiment 
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analysis, enhancement of multi-channel consumer experience; merchandizing: 
assortment optimization, pricing optimization, placement and design optimiza- 
tion; operations: performance transparency, labour inputs optimization; supply 
chain: inventory management, distribution and logistics optimization, informing 
supplier negotiations; new business models: price comparison services, 
web-based markets. 

e Manufacturing: The manufacturing sector was an early adopter of IT to design, 
build, and distribute products. The next-generation of smart factories with 
intelligent and networked machinery (i.e. Internet of Things, Industry 4.0) will 
see further efficiency improvement in design, production, and product quality. 
Big data will enable fulfilment of customer needs through precisely targeted 
products and effective distribution. In addition to efficiency gains and predictive 
maintenance, big data will enable entirely new business models in the area of 
mass production of individualized products. 

¢ Energy and Transport: Big data will open up new opportunities for innovative 
ways to monitor and control transportation and logistics networks using a variety 
of data sources and the Internet of Things. The potential for big data in the 
transport sector is estimated at USD 500 billion worldwide in the form of time 
and fuel savings, with the avoidance of 380 megatonnes of CO, emissions 
(OECD 2013). The digitization of energy systems enables the acquisition of 
real-time, high-resolution data via smart metres that can be leveraged within 
advanced analytics to improve the levels of efficiency within both the demand 
and supply sides of energy networks. Smart buildings and smart cities will be 
key drivers of enhanced efficiency in the energy sectors. Big data technology in 
the utilities sectors has the potential to reduce CO, emissions by more than 
2 gigatonnes, equivalent to 79 billion euros (OECD 2013). 


A successful data ecosystem would “bring together data owners, data analytics 
companies, skilled data professionals, cloud service providers, companies from the 
user industries, venture capitalists, entrepreneurs, research institutes and universi- 
ties” (DG Connect 2013). A successful data ecosystem, which is a prominent 
feature of the data-driven economy, would see these stakeholders interact seam- 
lessly within a Digital Single Market, leading to business opportunities, easier 
access to knowledge, and capital (European Commission 2014). “The Commission 
can contribute to this by bringing the relevant players together and by steering the 
available financial resources that facilitate collaboration among the various stake- 
holders in the European data economy” (DG Connect 2013). 

Big data offers tremendous untapped potential value for many sectors; however, 
there is no coherent data ecosystem in Europe. As Commissioner Kroes explained, 
“The fragmentation concerns sectors, languages, as well as differences in laws and 
policy practices between EU countries” (European Commission 2013; Kroes 2013). 
During the ICT 2013 Conference, Commissioner Kroes called for a European 
public—private partnership on big data to create a coherent European data ecosys- 
tem that stimulates research and innovation around data, as well as the uptake of 
cross-sector, cross-lingual, and cross-border data services and products. She also 
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noted the need for ensuring privacy “Mastering big data means mastering privacy 
too” (Kroes 2013). In order for this to occur, an interdisciplinary approach is 
required to create an optimal business environment for big data that will accelerate 
adoption within Europe. 


1.4 A Big Data Innovation Ecosystem 


In order to drive innovation and competitiveness, Europe needs to foster the 
development and wide adoption of big data technologies, value adding use cases, 
and sustainable business models. While no coherent data ecosystem exists at the 
European level (DG Connect 2013), the benefits of sharing and linking data across 
domains and industry are becoming obvious. An ecosystem approach allows 
organizations to create new value that no single organization could achieve by 
itself (Adner 2006). A European Big Data Ecosystem is an important factor for 
commercialization and commoditization of big data services, products, and plat- 
forms. Within a healthy business ecosystem, companies can work together in a 
complex business web where they can easily exchange and share vital resources 
(Kim et al. 2010). If a Big Data Ecosystem is to emerge in Europe, it is important 
that the different actors within the ecosystem “define a shared vision and jointly 
identify gaps in the current data landscape” (DG Connect 2013). A successful big 
data ecosystem would see all “stakeholders interact seamlessly within a Digital 
Single Market, leading to business opportunities, easier access to knowledge, and 
capital” (European Commission 2014). 


1.4.1 The Dimensions of European Big Data Ecosystem 


An efficient use and understanding of big data as an economic asset carries great 
potential for the EU economy and society. The challenges for establishing a Big 
Data Ecosystem in Europe have been defined into a set of key dimensions 
(Cavanillas et al. 2014) as illustrated in Fig. 1.1. Europe must address these multiple 
challenges (Cavanillas et al. 2014) to foster the development of a big data 
ecosystem. 


e Data: Availability and access to data will be the foundation of any data-centric 
ecosystem. A healthy data ecosystem will consist of a wide spectrum of different 
data types: structured, unstructured, multi-lingual, machine and sensor gener- 
ated, static, and real-time data. The data in the ecosystem should come from 
different sectors, including healthcare, energy, retail, and from both public and 
private sources. Value may be generated in many ways, by acquiring data, 
combining data from different sources and across sectors, providing low latency 
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Fig. 1.1 The dimensions of a Big Data Value Ecosystem [adapted from Cavanillas et al. (2014)] 


access, improving data quality, ensuring data integrity, enriching data, 
extracting insights, and preserving privacy. 

e Skills: A critical challenge for Europe will be ensuring the availability of skilled 
workers in the data ecosystem. An active ecosystem will require data scientists 
and engineers who have expertise in analytics, statistics, machine learning, data 
mining, and data management. Technical experts will need to be combined with 
data savvy business experts with strong domain knowledge and the ability to 
apply their data know-how within organizations for value creation. 

e Legal: Appropriate regulatory environments are needed to facilitate the devel- 
opment of a pan-European big data marketplace. Legal clarity is needed on 
issues such as data ownership, usage, protection, privacy, security, liability, 
cybercrime, intellectual property rights, and the implications of insolvencies 
and bankruptcy. 

¢ Technical: Key technical challenges need to be overcome including large-scale 
and heterogeneous data acquisition, efficient data storage, massive real-time 
data processing and data analysis, data curation, advanced data retrieval and 
visualization, intuitive user interfaces, interoperability and linking data, infor- 
mation, and content. All of these topics need to be advanced to sustain or 
develop competitive advantages. 

¢ Application: Big data has the potential to transform many sectors and domains 
including the health, public sector, finance, energy, and transport sectors. 


10 J.M. Cavanillas et al. 


Innovative value-driven applications and solutions must be developed, vali- 
dated, and delivered in the big data ecosystems if Europe is to become the 
world leader. 

¢ Business: A big data ecosystem can support the transformation of existing 
business sectors and the development of new start-ups with innovative business 
models to stimulate growth in employment and economic activity. 

e Social: It is critical to increase awareness of the benefits that big data can deliver 
for business, the public sector, and the citizen. Big data will provide solutions for 
major societal challenges in Europe, such as improved efficiency in healthcare, 
increased liveability of cities, enhanced transparency in government, and 
improved sustainability. 


1.5 Summary 


Big data is one of the key economic assets of the future. Mastering the potential of 
big data technologies and understanding their potential to transform industrial 
sectors will enhance the competitiveness of European companies and result in 
economic growth and jobs. Europe needs a clear strategy to increase the compet- 
itiveness of European industries in order to drive innovation. Europe needs to foster 
the development and wide adoption of big data technologies, value adding use 
cases, and sustainable business models through a Big Data Ecosystem. Strategic 
investments are needed by both the public and private sector to enable Europe to be 
the leader in the global data-driven digital economy and to reap the benefits it offers 
with the creation of a European Big Data Ecosystem. 
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Chapter 2 
The BIG Project 


Edward Curry, Tilman Becker, Ricard Munné, Nuria De Lama, 
and Sonja Zillner 


2.1 Introduction 


The Big Data Public Private Forum (BIG) Project (http://www.big-project.eu/) was 
an EU coordination and support action to provide a roadmap for big data within 
Europe. The BIG project worked towards the definition and implementation of a 
clear big data strategy that tackled the necessary activities needed in research and 
innovation, technology adoption, and the required support from the European 
Commission necessary for the successful implementation of the big data economy. 
As part of this strategy, the outcomes of the project were used as input for 
Horizon 2020. 
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Foundational research technologies and innovative sectorial applications were 
analysed and assessed in the BIG project in order to create technology and strategy 
roadmaps so that business and operational communities understand the potential of 
big data technologies and are enabled to implement appropriate strategies and 
technologies for commercial benefit. 

This chapter provides an overview of the BIG project detailing the project’s 
mission and strategic objectives. The chapter describes the partners within the 
consortium and the overall structure of the project work. The three-phase method- 
ology used in the project is described, including details on the techniques used 
within the technical working groups, sectorial forms, and road mapping activity. 
Finally, the project’s role in setting up the Horizon 2020 Big Data Value contractual 
Public Private Partnership and Big Data Value Association is discussed. 


2.2 Project Mission 


In order to realize the vision of a data-driven society in 2020, Europe has to prepare 
the right ecosystem around big data. Public and private organizations need to have 
the necessary infrastructures and technologies to deal with the complexity of big 
data, but should also be able to use data to maximize their competitiveness and 
deliver business value. 

Building an industrial community around big data in Europe was a key priority 
of the BIG project, together with setting up the necessary collaboration and 
dissemination infrastructure to link technology suppliers, integrators, and leading 
user organizations. The BIG project (from now on referred to as BIG) worked 
towards the definition and implementation of a strategy that includes research and 
innovation, but also technology adoption. The establishment of the community 
together with adequate resources to work at all levels (technical, business, political, 
etc.) is the basis for a long-term European strategy. Convinced that a strong reaction 
is needed, BIG defined its mission accordingly: 


The mission of BIG is setting up an ecosystem that will bring together all the 
relevant stakeholders needed to materialize a data-driven society in 2020. 
This ecosystem will ensure that Europe plays a leading role in the definition 
of the new context by building the necessary infrastructures and technologies, 
generating a suitable innovation space where all organizations benefit from 
data, and provides a pan-European framework to coherently address policy, 
regulatory, legal, and security barriers. 


The BIG mission was broken down into a number of specific strategic objectives 
for the project. 
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2.3 Strategic Objectives 


In September 2012, the project identified a set of strategic objectives to ensure it 
delivered on its mission. The specific objectives were: 


BIG will set up an industrial-led initiative around Intelligent Information 
Management and Big Data to contribute to EU competitiveness and position 
it in Horizon 2020: Industrial leadership will guide actions towards real busi- 
ness benefits, but will be complemented by the views of academia and research 
organizations, which will also take part in this endeavour. The project will take a 
long-term approach to represent the views and interests of IIM stakeholders, 
with a special focus on big data due to its relevance in the current and future 
context. Decisions such as establishing it as a legal entity will be considered, and 
potential mergers with relevant associations at the EU level will also be envis- 
aged for the sake of sustainability and impact. 

BIG will elaborate an integrated roadmap that takes into consideration 
technical, business, policy, and society aspects, focusing not only on pure 
technical issues, but also establishing priorities based on expected impact. The 
BIG consortium will engage the necessary expertise to ensure contributions not 
only from project partners, but also from a wider community comprised of 
experts in relevant technical domains as well as experts in sectors or application 
domains where the use of these technologies is expected to produce a high 
impact. 

BIG will ensure that technical research areas selected by the project cover 
the needs expressed by the industry in different application domains: For 
this to happen, a sharp understanding is needed of how big data can be applied 
within industrial sectors. This understanding needs to be transmitted to domain 
experts to establish a clear path for the adoption of the technology in each of the 
selected sectors. 

BIG will promote adoption of earlier waves of big data technology: Instead 
of adopting only a futuristic approach, BIG will use as a starting point those 
technologies that are already in place. The objective is to reach a clear under- 
standing of the level of maturity of different technical solutions as well as the 
feasibility of their implementation. This will be valuable information with 
respect to the state of the art and will be used as input for the elaboration of 
both the sectorial and the integrated roadmaps. 

BIG will define and promote actions dealing with policy and regulation, 
including aspects such as data security, intellectual property, privacy, liability, 
and data access. BIG will contribute to the entire ecosystem related to big data 
implementation without restricting its activities to only technical issues. 

BIG will carry out dissemination actions targeting different stakeholders 
and players in the value chain: Dissemination actions will be customized to the 
different communities (e.g. technical experts, data scientists, technical man- 
agers, business managers, and executives in both Multinational Corporation 
(MNC) and Small and Medium-sized Enterprises (SMEs)). BIG addressed all 
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the relevant communities with an ambitious strategy including presence in mass 
media, relevant conferences, organization of workshops and events, and maxi- 


mization of the use of web channels. 


e All this will not have been possible without providing the right collaboration 
infrastructures. Collaboration among projects, but also many discussions 
between all the relevant stakeholders and actors in the value chain, including 
major industrial organizations in the EU landscape, will take place. Bearing this 
in mind BIG set up and maintained a support infrastructure that will enable 
collaboration, information sharing, and customization of actions toward differ- 


ent targeted audiences. 


2.4 Consortium 


The participants of the BIG consortium (illustrated in Fig. 2.1) were carefully 
selected to include key players with complementary skills in both industry and 
academia. Each of the project partners had experience in cutting-edge European 
projects and significant connections to key stakeholders in the big data marketplace. 
The academic partners using their expert knowledge in the field lead the technical 
investigations of big data technology. The industrial partners were well positioned 
in their knowledge of large-scale data management products and services and their 


application within different industrial sectors. 
The partners of the BIG consortium were: 
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¢ Industry: Atos, Press Association (PA), Siemens, AGT International, Exalead, 
and the Open Knowledge Foundation (OKF) 

e Academia: University of Innsbruck (UIBK), National University of Ireland 
Galway (NUIG), University of Leipzig, German Research Centre for Artificial 
Intelligence (DFKI), and STI International 


2.5 Stakeholder Engagement 


Essential for the success of a large-scale, cross-fertilization, and broad road map- 
ping effort is the involvement of a large fraction of the community and industry, not 
only from the point of view of technology provision but also technology adoption. 
The project took an inclusive approach to stakeholder engagement and actively 
solicited inputs from the wider community composed of experts in technical 
domains as well as experts in business sectors. An open philosophy was applied 
to all the documents generated by the project, which were made public to the wider 
community for active contribution and content validation. The project held stake- 
holder workshops to engage the community within the project. The first workshop 
was held at the European Data Forum (EDF) 2013 in Dublin to announce the project 
to the community and gather participants. The second workshop took place at EDF 
2014 in Athens to present the interim results of the project for feedback and further 
validation with stakeholders. Over the duration of the project a number of well- 
attended sector-specific workshops were held to gather needs and validate findings. 
At the end of the project a final workshop was convened to present the results of the 
project in October 2014 in Heidelberg. 


2.6 Project Structure 


The work of the BIG project was split into groups focusing on industrial sectors and 
technical areas. The project structure comprised of sectorial forums and technical 
working groups. 


Sectorial forums examined how big data technologies can enable business inno- 
vation and transformation within different sectors. The sector forums were led by 
the industrial partners of the project. Their objective was to gather big data 
requirements from vertical industrial sectors, including health, public sector, 
finance, insurance, telecoms, media, entertainment, manufacturing, retail, energy, 
and transport (see Fig. 2.2). 


Technical working groups focused on big data technologies for each activity in 
the data value chain to examine their capabilities, level of maturity, clarity, under- 
standability, and suitability for implementation. The technical groups (see Fig. 2.3) 
were led by the academic partners in BIG and examined emerging technological 
and research trends for coping with big data. 
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Fig. 2.2 Sectorial forums within the BIG project 
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Fig. 2.3 Technical working groups within the BIG project 
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Fig. 2.4 The BIG project structure 
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As illustrated in Fig. 2.4, the needs identified by sector forums were used to 
understand the maturity and gaps in the capability offered by current big data 
technology. This analysis provided a clear picture on the limitations and expecta- 
tions regarding big data technology deployment. The outputs of the analysis were 
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used to produce a series of consensus-reflecting roadmaps that defined priorities and 
actions needed for big data in each sector. 


2.7 Methodology 


From an operational point of view, BIG defined a set of activities based on a three- 
phase approach as illustrated in Fig. 2.5. The three phases were: 


1. Technology state of the art and sector analysis 
2. Roadmapping activity 
3. Big data public private partnership 


2.7.1 Technology State of the Art and Sector Analysis 


In the first phase of the project, the sectorial forums and the technical working 
groups performed a parallel investigation in order to identify: 
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Fig. 2.5 Three-phase methodology of BIG 
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e Sectorial needs and requirements gathered from different stakeholders 
¢ The state of the art of big data technologies as well as identifying research 
challenges 


As part of the investigation, application sectors expressed their needs with 
respect to the technology as well as possible limitations and expectations regarding 
its current and future deployment. 

Using the results of the investigation a gap analysis was performed between what 
technology capability was ready, with the sectorial expectations of what techno- 
logical capability was currently required together with future requirements. The 
analysis produced a series of consensus-reflecting sectorial roadmaps that defined 
priorities and actions to guide further steps in big data research. 


2.7.1.1 Technical Working Groups 


The goal of the technical working groups was to investigate the state of the art in big 
data technologies to determine its level of maturity, clarity, understandability, and 
suitability for implementation. To allow for an extensive investigation and detailed 
mapping of developments, the technical working groups deployed a combination of 
a top-down and bottom-up approach, with a focus on the latter. The approach of the 
working groups was based on a 4-step approach: (1) literature research, (2) subject 
matter expert interviews, (3) stakeholder workshops, and (4) technical survey. 

In the first step each technical working group performed a systematic literature 
review based on the following activities: 


¢ Identification of relevant type and sources of information 

e Analysis of key information in each source 

¢ Identification of key topics for each technical working group 

e Identification of the key subject matter experts for each topic as potential 
interview candidates 

e Synthesizing the key message of each data source into state-of-the art descrip- 
tions for each identified topic 


The experts within the consortium outlined the initial starting points for each 
technical area, and the topics were expanded through the literature search and from 
the subject matter expert interviews. 

The following types of data sources were used: scientific papers published in 
workshops, symposia, conferences, journals and magazines, company white papers, 
technology vendor websites, open source projects, online magazines, analysts’ data, 
web blogs, other online sources, and interviews conducted by the BIG consortium. 
The groups focused on sources that mention concrete technologies and analysed 
them with respect to their values and benefits. 

The synthesis step compared the key messages and extracted agreed views that 
were then summarized in the technical white papers. Topics were prioritized based 
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on the degree to which they are able to address business needs as identified by the 
sectorial forum working groups. 

The literature survey was complemented with a series of interviews with subject 
matter experts for relevant topic areas. Subject matter expert interviews are a 
technique well suited to data collection and particularly for exploratory research 
because it allows expansive discussions that illuminate factors of importance 
(Oppenheim 1992; Yin 2009). The information gathered is likely to be more 
accurate than information collected by other methods since the interviewer can 
avoid inaccurate or incomplete answers by explaining the questions to the inter- 
viewee (Oppenheim 1992). 

The interviews followed a semi-structured protocol. The topics of the interview 
covered different aspects of big data, with a focus on: 


e Goals of big data technology 

¢ Beneficiaries of big data technology 

¢ Drivers and barriers for big data technologies 

¢ Technology and standards for big data technologies 


An initial set of interviewees was identified from the literature survey, contacts 
within the consortium, and a wider search of the big data ecosystem. Interviewees 
were selected to be representative of the different stakeholders within the big data 
ecosystem. The selection of interviewees covered (1) established providers of big 
data technology (typically MNCs), (2) innovative sectorial players who are suc- 
cessful at leveraging big data, (3) new and emerging SMEs in the big data space, 
and (4) world leading academic authorities in technical areas related to the Big Data 
Value Chain. 


2.7.1.2 Sectorial Forums 


The overall objective of the sectorial forums was to acquire a deep understanding of 
how big data technology can be used in the various industrial sectors, such as 
healthcare, public, finance and insurance, and media. 

In order to identify the user needs and industrial requisites of each domain, the 
sectorial forums followed a research methodology encompassing the following 
three steps as illustrated in Fig. 2.6. For each industrial sector, the steps were 
accomplished separately. However, in the case where sectors were related (such 
as energy and transport) the results have been merged for those sectors in order to 
highlight differences and similarities. 

The aim of the first steps was to identify both stakeholders and use cases for big 
data applications within the different sectors. Therefore, a survey was conducted 
including scientific reviews, market studies, and other Internet sources. This knowl- 
edge allowed the sectorial forums to identify and select potential interview partners 
and guided the development of the questionnaire for the domain expert interviews. 

The questionnaire consisted of up to 12 questions that were clustered into three 
parts: 
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Fig. 2.6 The three steps of the sectorial forums research methodology 


¢ Direct inquiry of specific user needs 

¢ Indirect evaluation of user needs by discussing the relevance of the use cases 
identified at Step 1 as well as any other big data applications of which they were 
aware 

¢ Reviewing constraints that need to be addressed in order to foster the imple- 
mentation of big data applications in each sector 


In the second step, semi-structured interviews were conducted using the devel- 
oped questionnaire. At least one representative of each stakeholder group identified 
in Step 1 was interviewed. To derive the user needs from the collected material, the 
most relevant and frequently mentioned use cases were aggregated into high-level 
application scenarios. The data collection and analysis strategy was inspired by the 
triangulation approach (Flick 2004). Reviewing and quantitatively assessing the 
high-level application scenarios derived a reliable analysis of user needs. Exami- 
nations of the likely constraints of big data applications helped to identify the 
relevant requirements that needed to be addressed. 

The third step involved a crosscheck and validation of the initial results of the 
first two steps by involving stakeholders of the domain. Some sectors conducted 
dedicated workshops and webinars with industrial stakeholders to discuss and 
review the outcomes. The results of the workshops were studied and integrated 
whenever appropriate. 


2.7.2 Cross-Sectorial Roadmapping 


Comparison among the different sectors enabled the identification of commonali- 
ties and differences at multiple levels, including technical, policy, business, and 
regulatory. The analysis was used to define an integrated cross-sectorial roadmap 
that provides a coherent holistic view of the big data domain. The cross-sectorial 
big data roadmap was defined using the following three steps: 


1. Consolidation to establish a common understanding of requirements as well as 
technology descriptions and terms used across domains 

2. Mapping to identify any technologies needed to address the identified cross- 
sector requirements 
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3. Temporal alignment to highlight which technologies need to be available at 
what point in time by incorporating the estimated adoption rate by the involved 
stakeholders 


The remainder of this section describes each of these steps in more detail. 


2.7.2.1 Consolidation 


Alignment among the technical working groups, and between the technical working 
groups and the sectorial forums, was important and facilitated through early 
exchange of drafts, one-on-one meetings, and the collection of consolidated 
requirements through the SFs. In order to align the sector-specific labelling of 
requirements, a consolidated description was established. In doing so, each sector 
provided their requirements with the associated user needs. In dedicated meetings, 
similar and related requirements were clustered and then merged, aligned, or 
restructured. Thus, the initial list of 13 high-level requirements and 28 sub-level 
requirements could be reduced to 8 high-level requirements and 25 sub-level 
requirements. In summary the consolidation phase reduced the total number of 
requirements by 20 %. 


2.7.2.2 Mapping 


For mapping technology to requirements the technical working groups indicated 
which technology could be used to address the consolidated requirements. Besides 
providing a mapping between requirements and technologies, the technical working 
groups also indicated the associated research challenges. 

Within a 1-day workshop, the initial mapping of technologies and requirements 
was consolidated in two steps. First, the indicated technological capabilities were 
analysed in further detail by describing how the sector-specific aspects of each 
cross-sector requirement can be handled. Second, for each cross-sector requirement 
it was investigated whether the technologies from various technical working groups 
need to be combined in order to address the full scope of the requirement. At the end 
of the discussion, any technologies that were requested by at least two sectors were 
included into the cross-sector roadmap. 


2.7.2.3 Temporal Alignment 
After identifying the key technologies, their temporal alignment needed to be 
defined. This was achieved by answering two questions: 


¢ How long is the development time of the technology? 
e When will the stakeholder involved adopt the technology? 
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The development time for each technology indicates how much time is needed to 
solve the associated research challenges. This time frame depends on the technical 
complexity of the challenge together with the extent to which sector-specific 
extensions are needed. In order to determine the adoption rate of big data technol- 
ogy (or the associated use case) non-technical requirements such as availability of 
business cases, suitable incentive structures, legal frameworks, potential benefits, as 
well as the total cost for all the stakeholders involved (Adner 2012) were 
considered. 


2.8 Big Data Public Private Partnership 


The Big Data Public Private Forum, as it was initially called, was intended to create 
the path towards implementation of the roadmaps. The path required two major 
elements: (1) a mechanism to include content of the roadmaps into real agendas 
supported by the necessary resources (economic investment of both public and 
private stakeholders) and (2) a community interested in the topics and committed to 
making the investment and collaborating towards the implementation of the 
agendas. 

The BIG consortium was convinced that achieving this result would require 
creating a broad awareness and commitment outside the project. BIG took the 
necessary steps to contact major players and to liaise with the NESSI European 
Technology Platform to jointly work towards this endeavour. The collaboration was 
set up in the summer of 2013 and allowed the BIG partners to establish the 
necessary high-level connections at both industrial and political levels. This col- 
laboration led to the following outcomes: 


¢ The Strategic Research & Innovation Agenda (SRIA) on Big Data Value that 
was initially fed by the BIG technical papers and roadmaps and has been 
extended with the input of a public consultation that included hundreds of 
additional stakeholders representing both the supply and the demand side. 

¢ A cPPP (contractual PPP) proposal as the formal step to set up a PPP on Big 
Data Value. The cPPP proposal builds on the SRIA by adding additional content 
elements such as potential instruments that could be used for the implementation 
of the agenda. 

e The formation of a representative community of stakeholders that has 
endorsed the SRIA and expressed an interest and commitment in getting 
involved in the cPPP. The identification of an industrially led core group ready 
to commit to the objectives of the cPPP with a willingness to invest money 
and time. 

¢ The establishment of a legal entity based in Belgium: a non-profit organization 
named Big Data Value Association (BDVA) to represent the private side of the 
cPPP. The BDVA had 24 founding members, including many partners of the 
BIG project. 
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e And finally, the signature of the Big Data Value cPPP between the BDVA and 
the European Commission. The cPPP was signed by Vice President Neelie 
Kroes, the then EU Commissioner for the Digital Agenda, and Jan Sundelin, 
the president of the Big Data Value Association (BDVA), on 13 October 2014 in 
Brussels. The BDV cPPP provides a framework that guarantees the industrial 
leadership, investment, and commitment of both the private and the public side 
to build a data-driven economy across Europe, mastering the generation of value 
from big data and creating a significant competitive advantage for European 
industry that will boost economic growth and jobs. 


2.9 Summary 


The Big Data Public Private Forum (BIG) Project was an EU coordination and 
support action to provide a roadmap for big data within Europe. The BIG project 
worked towards the definition and implementation of a clear big data strategy that 
tackled the necessary activities needed in research and innovation, technology 
adoption, and the required support from the European Commission necessary for 
the successful implementation of the big data economy. 

The BIG project used a three-phase methodology with technical working groups 
examining foundational technologies, sectorial forums examining innovative sec- 
torial applications, and a road mapping activity to create technology and strategy 
roadmaps so that business and operational communities understand the potential of 
big data technologies and are enabled to implement appropriate strategies and 
technologies for commercial benefit. The project was a key contributor to setting 
up the Horizon 2020 Big Data Value Association contractual Public Private Part- 
nership (cPPP) and Big Data Value Association. 
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Chapter 3 
The Big Data Value Chain: Definitions, 
Concepts, and Theoretical Approaches 


Edward Curry 


3.1 Introduction 


The emergence of a new wave of data from sources, such as the Internet of Things, 
Sensor Networks, Open Data on the Web, data from mobile applications, social 
network data, together with the natural growth of datasets inside organisations 
(Manyika et al. 2011), creates a demand for new data management strategies 
which can cope with these new scales of data environments. Big data is an emerging 
field where innovative technology offers new ways to reuse and extract value from 
information. The ability to effectively manage information and extract knowledge 
is now seen as a key competitive advantage, and many organisations are building 
their core business on their ability to collect and analyse information to extract 
business knowledge and insight. Big data technology adoption within industrial 
sectors is not a luxury but an imperative need for most organisations to gain 
competitive advantage. 

This chapter examines definitions and concepts related to big data. The chapter 
starts by exploring the different definitions of “Big Data” which have emerged over 
the last number of years to label data with different attributes. The Big Data Value 
Chain is introduced to describe the information flow within a big data system as a 
series of steps needed to generate value and useful insights from data. The chapter 
explores the concept of Ecosystems, its origins from the business community, and 
how it can be extended to the big data context. Key stakeholders of a big data 
ecosystem are identified together with the challenges that need to be overcome to 
enable a big data ecosystem in Europe. 
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3.2 What Is Big Data? 


Over the last years, the term “Big Data” was used by different major players to label 
data with different attributes. Several definitions of big data have been proposed 
over the last decade; see Table 3.1. The first definition, by Doug Laney of META 
Group (then acquired by Gartner), defined big data using a three-dimensional 
perspective: “Big data is high volume, high velocity, and/or high variety informa- 
tion assets that require new forms of processing to enable enhanced decision- 
making, insight discovery and process optimization” (Laney 2001). Loukides 
(2010) defines big data as “when the size of the data itself becomes part of the 
problem and traditional techniques for working with data run out of steam”. Jacobs 
(2009) describes big data as “data whose size forces us to look beyond the tried- 
and-true methods that are prevalent at that time”. 

Big data brings together a set of data management challenges for working with 
data under new scales of size and complexity. Many of these challenges are not 
new. What is new however are the challenges raised by the specific characteristics 
of big data related to the 3 Vs: 


e Volume (amount of data): dealing with large scales of data within data 
processing (e.g. Global Supply Chains, Global Financial Analysis, Large Hadron 
Collider). 

e Velocity (speed of data): dealing with streams of high frequency of incoming 
real-time data (e.g. Sensors, Pervasive Environments, Electronic Trading, Inter- 
net of Things). 

e Variety (range of data types/sources): dealing with data using differing syn- 
tactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings 
(e.g. Enterprise Data Integration). 


The Vs of big data challenge the fundamentals of existing technical approaches 
and require new forms of data processing to enable enhanced decision-making, 
insight discovery, and process optimisation. As the big data field matured, other Vs 
have been added such as Veracity (documenting quality and uncertainty), Value, 
etc. The value of big data can be described in the context of the dynamics of 
knowledge-based organisations (Choo 1996), where the processes of decision- 
making and organisational action are dependent on the process of sense-making 
and knowledge creation. 


3.3 The Big Data Value Chain 


Within the field of Business Management, Value Chains have been used as a 
decision support tool to model the chain of activities that an organisation performs 
in order to deliver a valuable product or service to the market (Porter 1985). The 
value chain categorises the generic value-adding activities of an organisation 
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Table 3.1 Definitions of big data 


Big data definition Source 

“Big data is high volume, high velocity, and/or high variety infor- | Laney (2001), Manyika 
mation assets that require new forms of processing to enable et al. (2011) 

enhanced decision making, insight discovery and process 

optimization” 


“When the size of the data itself becomes part of the problem and | Loukides (2010) 
traditional techniques for working with data run out of steam” 


Big Data is “data whose size forces us to look beyond the tried-and- | Jacobs (2009) 
true methods that are prevalent at that time” 


“Big Data technologies [are] a new generation of technologies and | IDC (2011) 
architectures designed to extract value economically from very large 
volumes of a wide variety of data by enabling high-velocity capture, 
discovery, and/or analysis” 


“The term for a collection of datasets so large and complex that it | Wikipedia (2014) 
becomes difficult to process using on-hand database management 
tools or traditional data processing applications” 


“A collection of large and complex data sets which can be processed | Mike 2.0 (2014) 
only with difficulty by using on-hand database management tools” 


“Big Data is a term encompassing the use of techniques to capture, | NESSI (2012) 
process, analyse and visualize potentially large datasets in a rea- 
sonable timeframe not accessible to standard IT technologies.” By 
extension, the platform, tools and software used for this purpose are 
collectively called “Big Data technologies” 


“Big data can mean big volume, big velocity, or big variety” Stonebraker (2012) 


allowing them to be understood and optimised. A value chain is made up of a series 
of subsystems each with inputs, transformation processes, and outputs. Rayport and 
Sviokla (1995) were one of the first to apply the value chain metaphor to informa- 
tion systems within their work on Virtual Value Chains. As an analytical tool, the 
value chain can be applied to information flows to understand the value creation of 
data technology. In a Data Value Chain, information flow is described as a series of 
steps needed to generate value and useful insights from data. The European 
Commission sees the data value chain as the “centre of the future knowledge 
economy, bringing the opportunities of the digital developments to the more 
traditional sectors (e.g. transport, financial services, health, manufacturing, retail)” 
(DG Connect 2013). 

The Big Data Value Chain (Curry et al. 2014), as illustrated in Fig. 3.1, can be 
used to model the high-level activities that comprise an information system. The 
Big Data Value Chain identifies the following key high-level activities: 


Data Acquisition is the process of gathering, filtering, and cleaning data before it 
is put in a data warehouse or any other storage solution on which data analysis can 
be carried out. Data acquisition is one of the major big data challenges in terms of 
infrastructure requirements. The infrastructure required to support the acquisition 
of big data must deliver low, predictable latency in both capturing data and in 
executing queries; be able to handle very high transaction volumes, often in a 
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Fig. 3.1 The Big Data Value Chain as described within (Curry et al. 2014) 


distributed environment; and support flexible and dynamic data structures. Data 
acquisition is further detailed in this chapter. 


Data Analysis is concerned with making the raw data acquired amenable to use in 
decision-making as well as domain-specific usage. Data analysis involves explor- 
ing, transforming, and modelling data with the goal of highlighting relevant data, 
synthesising and extracting useful hidden information with high potential from a 
business point of view. Related areas include data mining, business intelligence, 
and machine learning. Chapter 4 covers data analysis. 


Data Curation is the active management of data over its life cycle to ensure it 
meets the necessary data quality requirements for its effective usage (Pennock 
2007). Data curation processes can be categorised into different activities such as 
content creation, selection, classification, transformation, validation, and preserva- 
tion. Data curation is performed by expert curators that are responsible for improv- 
ing the accessibility and quality of data. Data curators (also known as scientific 
curators, or data annotators) hold the responsibility of ensuring that data are 
trustworthy, discoverable, accessible, reusable, and fit their purpose. A key trend 
for the curation of big data utilises community and crowd sourcing approaches 
(Curry et al. 2010). Further analysis of data curation techniques for big data is 
provided in Chap. 5. 


Data Storage is the persistence and management of data in a scalable way that 
satisfies the needs of applications that require fast access to the data. Relational 
Database Management Systems (RDBMS) have been the main, and almost unique, 
solution to the storage paradigm for nearly 40 years. However, the ACID 
(Atomicity, Consistency, Isolation, and Durability) properties that guarantee data- 
base transactions lack flexibility with regard to schema changes and the perfor- 
mance and fault tolerance when data volumes and complexity grow, making them 
unsuitable for big data scenarios. NoSQL technologies have been designed with the 
scalability goal in mind and present a wide range of solutions based on alternative 
data models. A more detailed discussion of data storage is provided in Chap. 6. 
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Data Usage covers the data-driven business activities that need access to data, its 
analysis, and the tools needed to integrate the data analysis within the business 
activity. Data usage in business decision-making can enhance competitiveness 
through reduction of costs, increased added value, or any other parameter that 
can be measured against existing performance criteria. Chapter 7 contains a 
detailed examination of data usage. 


3.4 Ecosystems 


The term ecosystem was coined by Tansley in 1935 to identify a basic ecological 
unit comprising of both the environment and the organisms that use it. Within the 
context of business, James F. Moore (1993, 1996, 2006) exploited the biological 
metaphor and used the term to describe the business environment. Moore defined a 
business ecosystem as an “economic community supported by a foundation of 
interacting organizations and individuals” (Moore 1996). A strategy involving a 
company attempting to succeed alone has proven to be limited in terms of its 
capacity to create valuable products or services. It is crucial that businesses 
collaborate among themselves to survive within a business ecosystem (Moore 
1993; Gossain and Kandiah 1998). Ecosystems allow companies to create new 
value that no company could achieve by itself (Adner 2006). Within a healthy 
business ecosystem, companies can work together in a complex business web 
where they can easily exchange and share vital resources (Kim et al. 2010). 

The study of Business Ecosystems is an active area of research where 
researchers are investigating many facets of the business ecosystem metaphor to 
explore aspects such as community, cooperation, interdependency, co-evolution, 
eco-systemic functions, and boundaries of business environments. Koening (2012) 
provides a simple typology of Business Ecosystems based on the degree of key 
resource control and type of member interdependence. Types of business ecosys- 
tems include supply systems (i.e. Nike), platforms (Apple iTunes), communities of 
destiny (i.e. Sematech in the semiconductor industry), and expanding communities. 


3.4.1 Big Data Ecosystems 


In natural ecosystems, smart organisms control their energy. In business ecosys- 
tems, a smart company manages information and its flows (Kim et al. 2010). In 
terms of data, the ecosystem metaphor is useful to describe the data environment 
supported by a community of interacting organisations and individuals. Big Data 
Ecosystems can form in different ways around an organisation, community tech- 
nology platforms, or within or across sectors. Big Data Ecosystems exist within 
many industrial sectors where vast amount of data move between actors within 
complex information supply chains. Sectors with established or emerging data 
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ecosystems include Healthcare, Finance (O’Ridin et al. 2012), Logistics, Media, 
Manufacturing, and Pharmaceuticals (Curry et al. 2010). In addition to the data 
itself, Big Data Ecosystems can also be supported by data management platforms, 
data infrastructure (e.g. Various Apache open source projects), and data services. 


3.4.2 European Big Data Ecosystem 


While no coherent data ecosystem exists at the European-level (DG Connect 2013), 
the benefits of sharing and linking data across domains and industry sectors are 
becoming obvious. Initiatives such as smart cities are showing how different sectors 
(i.e. energy and transport) can collaborate to maximise the potential for optimisa- 
tion and value return. The cross-fertilisation of stakeholder and datasets from 
different sectors is a key element for advancing the big data economy in Europe. 

A European big data business ecosystem is an important factor for commercia- 
lisation and commoditisation of big data services, products, and platforms. A 
successful big data ecosystem would see all “stakeholders interact seamlessly 
within a Digital Single Market, leading to business opportunities, easier access to 
knowledge and capital” (European Commission 2014). 

A well-functioning working data ecosystem must bring together the key stake- 
holders with a clear benefit for all. The key actors in a big data ecosystem, as 
illustrated in Fig. 3.2, are: 


e Data Suppliers: Person or organisation [Large and small and medium-sized 
enterprises (SME)] that create, collect, aggregate, and transform data from both 
public and private sources 

¢ Technology Providers: Typically organisations (Large and SME) as providers 
of tools, platforms, services, and know-how for data management 

¢ Data End Users: Person or organisation from different industrial sectors (pri- 
vate and public) that leverage big data technology and services to their 
advantage. 

e Data Marketplace: Person or organisation that host data from publishers and 
offer it to consumers/end users. 

¢ Start-ups and Entrepreneurs: Develop innovative data-driven technology, 
products, and services. 

¢ Researchers and Academics: Investigate new algorithms, technologies, meth- 
odologies, business models, and societal aspects needed to advance big data. 

e Regulators for data privacy and legal issues. 

¢ Standardisation Bodies: Define technology standards (both official and de 
facto) to promote the global adoption of big data technology. 

¢ Investors, Venture Capitalists, and Incubators: Person or organisation that 
provides resources and services to develop the commercial potential of the 
ecosystem. 
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Fig. 3.2 The Micro, Meso, and Macro Levels of a Big Data Ecosystem [adapted from Moore 
(1996)] 


3.4.3 Toward a Big Data Ecosystem 


Enabling a European wide data ecosystem will require a number of technical 
challenges to be overcome associated with the cost and complexity of publishing 
and utilising data. Current ecosystems face a number of problems such as data 
discovery, curation, linking, synchronisation, distribution, business modelling, and 
sales and marketing. A number of key societal and environmental challenges need 
to be overcome to establish effective big data ecosystems; these include but are not 
limited to: 


e Understanding the value and contribution of big data technology 

¢ Determining the value of data 

¢ Identification of business models that will support a data-driven ecosystem 

¢ Enabling entrepreneurs and venture capitalists to easily access the ecosystem 

e Preservation of privacy and security for all actors in the ecosystem 

¢ Reducing fragmentation of languages, intellectual property rights, laws, and 
policy practices between EU countries 
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3.5 Summary 


Big data is the emerging field where innovative technology offers new ways to 
extract value from the tsunami of available information. As with any emerging area, 
terms and concepts can be open to different interpretations. The Big Data domain is 
no different. The different definitions of “Big Data” which have emerged show the 
diversity and use of the term to label data with different attributes. Two tools from 
the business community, Value Chains and Business Ecosystems, can be used to 
model big data systems and the big data business environments. Big Data Value 
Chains can describe the information flow within a big data system as a series of 
steps needed to generate value and useful insights from data. Big Data Ecosystems 
can be used to understand the business context and relationships between key 
stakeholders. A European big data business ecosystem is an important factor for 
commercialisation and commoditisation of big data services, products, and 
platforms. 
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Chapter 4 
Big Data Acquisition 


Klaus Lyko, Marcus Nitzschke, and Axel-Cyrille Ngonga Ngomo 


4.1 Introduction 


Over the last years, the term big data was used by different major players to label 
data with different attributes. Moreover, different data processing architectures for 
big data have been proposed to address the different characteristics of big data. 
Overall, data acquisition has been understood as the process of gathering, filtering, 
and cleaning data before the data is put in a data warehouse or any other storage 
solution. 

The position of big data acquisition within the overall big data value chain can be 
seen in Fig. 4.1. The acquisition of big data is most commonly governed by four of 
the Vs: volume, velocity, variety, and value. Most data acquisition scenarios 
assume high-volume, high-velocity, high-variety, but low-value data, making it 
important to have adaptable and time-efficient gathering, filtering, and cleaning 
algorithms that ensure that only the high-value fragments of the data are actually 
processed by the data-warehouse analysis. However, for some organizations, most 
data is of potentially high value as it can be important to recruit new customers. For 
such organizations, data analysis, classification, and packaging on very high data 
volumes play the most central role after the data acquisition. 

The goals of this chapter are threefold: First, it aims to identify the present 
general requirements for data acquisition by presenting open state-of-the-art frame- 
works and protocols for big data acquisition for companies. Our second goal is then 
to unveil the current approaches used for data acquisition in the different sectors. 
Finally, it discusses how the requirements to data acquisition are met by current 
approaches as well as possible future developments in the same area. 


K. Lyko (2) • M. Nitzschke » A.-C. Ngonga Ngomo 

University of Leipzig, Augustusplatz 10, 04109 Leipzig, Germany 

e-mail: lyko@informatik.uni-leipzig.de; nitzschke@informatik.uni-leipzig.de; 
ngonga@informatik.uni-leipzig.de 


© The Author(s) 2016 39 
J.M. Cavanillas et al. (eds.), New Horizons for a Data-Driven Economy, 
DOI 10.1007/978-3-319-21569-3_4 


40 K. Lyko et al. 


Big Data Value Chain 
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Fig. 4.1 Data acquisition in the big data value chain 


4.2 Key Insights for Big Data Acquisition 


To get a better understanding of data acquisition, the chapter will first take a look at 
the different big data architectures of Oracle, Vivisimo, and IBM. This will 
integrate the process of acquisition within the big data processing pipeline. 

The big data processing pipeline has been abstracted in numerous ways in 
previous works. Oracle (2012) relies on a three-step approach for data processing. 
In the first step, the content of different data sources is retrieved and stored within a 
scalable storage solution such as a NoSQL database or the Hadoop Distributed File 
System (HDFS). The stored data is subsequently processed by first being 
reorganized and stored in an SQL-capable big data analytics software and finally 
analysed by using big data analytics algorithms. 

Velocity (Vivisimo 2012) relies on a different view on big data. Here, the 
approach is more search-oriented. The main component of the architecture is a 
connector layer, in which different data sources can be addressed. The content of 
these data sources is gathered in parallel, converted, and finally added to an index, 
which builds the basis for data analytics, business intelligence, and all other data- 
driven applications. Other big players such as IBM rely on architectures similar to 
Oracle’s (IBM 2013). 

Throughout the different architectures to big data processing, the core of data 
acquisition boils down to gathering data from distributed information sources with 
the aim of storing them in scalable, big data-capable data storage. To achieve this 
goal, three main components are required: 


1. Protocols that allow the gathering of information for distributed data sources of 
any type (unstructured, semi-structured, structured) 

2. Frameworks with which the data is collected from the distributed sources by 
using different protocols 
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3. Technologies that allow the persistent storage of the data retrieved by the 
frameworks 


4.3 Social and Economic Impact of Big Data Acquisition 


Over the last years, the sheer amount of data that is produced in a steady manner has 
increased. Ninety percent of the data in the world today was produced over the last 
2 years. The source and nature of this data is diverse. It ranges from data gathered 
by sensors to data depicting (online) transactions. An ever-increasing part is 
produced in social media and via mobile devices. The type of data (structured 
vs. unstructured) and semantics are also diverse. Yet, all this data must be aggre- 
gated to help answer business questions and form a broad picture of the market. 

For business this trend holds several opportunities and challenges to both 
creating new business models and improving current operations, thereby generating 
market advantages. Tools and methods to deal with big data driven by the four Vs 
can be used for improved user-specific advertisement or market research in general. 
For example, smart metering systems are tested in the energy sector. Furthermore, 
in combination with new billing systems these systems could also be beneficial in 
other sectors such as telecommunication and transport. 

Big data has already influenced many businesses and has the potential to impact 
all business sectors. While there are several technical challenges, the impact on 
management and decision-making and even company culture will be no less great 
(McAfee and Brynjolfsson 2012). 

There are still several boundaries though. Namely privacy and security concerns 
need to be addressed by these systems and technologies. Many systems already 
generate and collect large amounts of data, but only a small fragment is used 
actively in business processes. In addition, many of these systems lack real-time 
requirements. 


4.4 Big Data Acquisition: State of the Art 


The bulk of big data acquisition is carried out within the message queuing para- 
digm, sometimes also called the streaming paradigm, publish/subscribe paradigm 
(Carzaniga et al. 2000), or event processing paradigm (Cugola and Margara 2012; 
Luckham 2002). Here, the basic assumption is that manifold volatile data sources 
generate information that needs to be captured, stored, and analysed by a big data 
processing platform. The new information generated by the data source is 
forwarded to the data storage by means of a data acquisition framework that 
implements a predefined protocol. This section describes the two core technologies 
for acquiring big data. 
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4.4.1 Protocols 


Several of the organizations that rely internally on big data processing have devised 
enterprise-specific protocols of which most have not been publicly released and can 
thus not be described in this chapter. This section presents the commonly used open 
protocols for data acquisition. 


4.4.1.1 AMQP 


The reason for the development of Advanced Message Queuing Protocol (AMQP) 
was the need for an open protocol that would satisfy the requirements of large 
companies with respect to data acquisition. To achieve this goal, 23 companies 
compiled a sequence of requirements for a data acquisition protocol. The resulting 
AMQP (Advanced Message Queuing Protocol) became an OASIS standard in 
October 2012. The rationale behind AMQP (Bank of America et al. 2011) was to 
provide a protocol with the following characteristics: 


¢ Ubiquity: This property of AMQP refers to its ability to be used across different 
industries within both current and future data acquisition architectures. AMQP’s 
ubiquity was achieved by making it easily extensible and simple to implement. 
The large number of frameworks that implement it, including SwiftMQ, 
Microsoft Windows Azure Service Bus, Apache Qpid, and Apache ActiveMQ, 
reflects how easy the protocol is to implement. 

¢ Safety: The safety property was implemented across two different dimensions. 
First, the protocol allows the integration of message encryption to ensure that 
even intercepted messages cannot be decoded easily. Thus, it can be used to 
transfer business-critical information. The protocol is robust against the injec- 
tion of spam, making the AMQP brokers difficult to attack. Second, the AMQP 
ensures the durability of messages, meaning that it allows messages to be 
transferred even when the sender and receiver are not online at the same time. 

e Fidelity: This third characteristic is concerned with the integrity of the message. 
AMQP includes means to ensure that the sender can express the semantics of the 
message and thus allow the receiver to understand what it is receiving. The 
protocol implements reliable failure semantics that allow systems to detect 
errors from the creation of the message at the sender’s end before the storage 
of the information by the receiver. 

¢ Applicability: The intention behind this property is to ensure that AMQP clients 
and brokers can communicate by using several of the protocols of the Open 
Systems Interconnection (OSI) model layers such as Transmission Control 
Protocol (TCP), User Datagram Protocol (UDP), and also Stream Control 
Transmission Protocol (SCTP). By these means, AMQP is applicable in many 
scenarios and industries where not all the protocols of the OSI model layers are 
required and used. Moreover, the protocol was designed to support different 
messaging patterns including direct messaging, request/reply, publish/ 
subscribe, etc. 
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¢ Interoperability: The protocol was designed to be independent of particular 
implementations and vendors. Thus, clients and brokers with fully independent 
implementations, architectures, and ownership can interact by means of AMQP. 
As stated above, several frameworks from different organizations now imple- 
ment the protocol. 

e Manageability: One of the main concerns during the specification of the AMQP 
was to ensure that frameworks that implement it could scale easily. This was 
achieved by ensuring that AMQP is a fault-tolerant and lossless wire protocol 
through which information of all types (e.g. XML, audio, video) can be 
transferred. 


To implement these requirements, AMQP relies on a type system and four 
different layers: a transport layer, a messaging layer, a transaction layer, and a 
security layer. The type system is based on primitive types from databases (integers, 
strings, symbols, etc.), described types as known from programming, and descriptor 
values that can be extended by the users of the protocol. In addition, AMQP allows 
the use of encoding to store symbols and values as well as the definition of 
compound types that consist of combinations of several primary types. 

The transport layer defines how AMQP messages are to be processed. An AMQP 
network consists of nodes that are connected via links. Messages can originate from 
(senders), be forwarded by (relays), or be consumed by nodes (receivers). Messages 
are only allowed to travel across a link when this link abides by the criteria defined 
by the source of the message. The transport layer supports several types of route 
exchanges including message fanout and topic exchange. 

The messaging layer of AMQP describes the structure of valid messages. A bare 
message is a message as submitted by the sender to an AMQP network. 

The transaction layer allows for the “coordinated outcome of otherwise inde- 
pendent transfers” (Bank of America et al. 2011, p. 95). The basic idea behind the 
architecture of the transactional messaging approach followed by the layer lies in 
the sender of the message acting as controller while the receiver acts as a resource 
as messages are transferred as specified by the controller. By these means, 
decentralized and scalable message processing can be achieved. 

The final AMQP layer is the security layer, which enables the definition of 
means to encrypt the content of AMQP messages. The protocols for achieving this 
goal are supposed to be defined externally from AMQP itself. Protocols that can be 
used to this end include transport layer security (TSL) and simple authentication 
and security layer (SASL). 

Due to its adoption across several industries and its high flexibility, it is likely 
that AMQP will become the standard approach for message processing in industries 
that cannot afford to implement their own dedicated protocols. With the upcoming 
data-as-a-service industry, it also promises to be the go-to solution for 
implementing services around data streams. One of the most commonly used 
AMQP brokers is RabbitMQ, whose popularity is mostly due to the fact that it 
implements several messaging protocols including JMS. 
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4.4.1.2 Java Message Service 


Java Message Service (JMS) API was included in the Java 2 Enterprise Edition on 
18 March 2002, after the Java Community Process in its final version 1.1 ratified it 
as a standard. 

According to the 1.1 specification JMS “provides a common way for Java 
programs to create, send, receive and read an enterprise messaging system’s 
messages”. Administrative tools allow one to bind destinations and connection 
factories into a Java Naming and Directory Interface (JNDI) namespace. A JMS 
client can then use resource injection to access the administered objects in the 
namespace and then establish a logical connection to the same objects through the 
JMS provider. 

The JNDI serves in this case as the moderator between different clients who 
want to exchange messages. Note that the term “client” is used here (as the spec 
does) to denote the sender as well as receiver of a message, because JMS was 
originally designed to exchange message peer-to-peer. Currently, JMS offers two 
messaging models: point-to-point and publisher-subscriber, where the latter is a 
one-to-many connection. 

AMQP is compatible with JMS, which is the de facto standard for message 
passing in the Java world. While AMQP is defined at the format level (i.e. byte 
stream of octets), JMS is standardized at API level and is therefore not easy to 
implement in other programing languages (as the “J” in “JMS” suggests). Also JMS 
does not provide functionality for load balancing/fault tolerance, error/advisory 
notification, administration of services, security, wire protocol, or message type 
repository (database access). 

A considerable advantage of AMQP is, however, the programming language 
independence of the implementation that avoids vendor-lock in and platform 
compatibility. 


4.4.2 Software Tools 


With respect to software tools for data acquisition, many of them are well known 
and many use cases are available all over the web so it is feasible to have a first 
approach to them. Despite this, the correct use of each tool requires a deep 
knowledge on the internal working and the implementation of the software. Dif- 
ferent paradigms of data acquisition have appeared depending on the scope these 
tools have been focused on. The architectural diagram in Fig. 4.2 shows an overall 
picture of the complete big data workflow highlighting the data acquisition part. 

In the remainder of this section, these tools and others relating to data acquisition 
are described in detail. 
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Fig. 4.2 Big data workflow 


4.4.2.1 Storm 


Storm is an open-source framework for the robust distributed real-time computation 
on streams of data. It started off as an open-source project and now has a large and 
active community. Storm supports a wide range of programming languages and 
storage facilities (relational databases, NoSQL stores, etc.). One of the main 
advantages of Storm is that it can be utilized in many data gathering scenarios 
including stream processing and distributed RPC for solving computationally 
intensive functions on-the-fly, and continuous computation applications (Gabriel 
2012). Many companies and applications are using Storm to power a wide variety 
of production systems processing data, including Groupon, The Weather Channel, 
fullcontact.com, and Twitter. 

The logical network of Storm consists of three types of nodes: a master node 
called Nimbus, a set of intermediate Zookeeper nodes, and a set of Supervisor nodes. 


¢ The Nimbus: is equivalent to Hadoop’s JobTracker: it uploads the computation 
for execution, distributes code across the cluster, and monitors computation. 

e The Zookeepers: handle the complete cluster coordination. This cluster orga- 
nization layer is based upon the Apache ZooKeeper project. 

¢ The Supervisor Daemon: spawns worker nodes; it is comparable to Hadoop’s 
TaskTracker. This is the place where most of the work of application developers 
goes into. The worker nodes communicate with the Nimbus via the Zookeepers 
to determine what to run on the machine, starting and stopping workers. 


A computation is called topology in Storm. Once deployed, topologies run 
indefinitely. There are four concepts and abstraction layers within Storm: 
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e Streams: unbounded sequence of tuples, which are named lists of values. 
Values can be arbitrary objects implementing a serialization interface. 

e Spouts: are sources of streams in a computation, e.g. readers for data sources 
such as the Twitter Streaming APIs. 

¢ Bolts: process any number of input streams and produce any number of output 
streams. This is where most of the application logic goes. 

¢ Topologies: are the top-level abstractions of Storm. Basically, a topology is a 
network of spouts and bolts connected by edges. Every edge is a bolt subscribing 
to the stream of a spout or another bolt. 


Both spouts and bolts are stateless nodes and inherently parallel, executing as 
many tasks across the cluster. From a physical point of view a worker is a Java 
Virtual Machine (JVM) process with a number of tasks running within. Both spouts 
and bolts are distributed over a number of tasks and workers. Storm supports a 
number of stream grouping approaches ranging from random grouping to tasks, to 
field grouping, where tuples are grouped by specific fields to the same tasks 
(Madsen 2012). 

Storm uses a pull model; each bolt pulls events from its source. Tuples traverse 
the entire network within a specified time window or are considered as failed. 
Therefore, in terms of recovery the spouts are responsible to keep tuples ready for 
replay. 


4.4.2.2 S4 


S4 (simply scalable streaming system) is a distributed, general-purpose platform for 
developing applications that process streams of data. Started in 2008 by Yahoo! 
Inc., since 2011 it is an Apache Incubator project. S4 is designed to work on 
commodity hardware, avoiding I/O bottlenecks by relying on an all-in-memory 
approach (Neumeyer 2011). 

In general keyed data events are routed to processing elements (PE). PEs receive 
events and either emit resulting events and/or publish results. The S4 engine was 
inspired by the MapReduce model and resembles the Actors model (encapsulation 
semantics and location transparency). Among others it provides a simple program- 
ming interface for processing data streams in a decentralized, symmetric, and 
pluggable architecture. 

A stream in S4 is a sequence of elements (events) of both tuple-valued keys and 
attributes. A basic computational unit PE is identified by the following four 
components: (1) its functionality provided by the PE class and associated config- 
uration, (2) the event types it consumes, (3) the keyed attribute in this event, and 
(4) the value of the keyed attribute of the consuming events. A PE is instantiated by 
the platform for each value of the key attribute. Keyless PEs are a special class of 
PEs with no keyed attribute and value. These PEs consume all events of the 
corresponding type and are typically at the input layer of an S4 cluster. There is a 
large number of standard PEs available for a number of typical tasks such as 
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aggregate and join. The logical hosts of PEs are the processing nodes (PNs). PNs 
listen to events, execute operations for incoming events, and dispatch events with 
the assistance of the communication layer. 

S4 routes each event to PNs based on a hash function over all known values of 
the keyed attribute in the event. There is another special type of PE object: the PE 
prototype. It is identified by the first three components. These objects are configured 
upon initialization and for any value it can clone itself to create a fully qualified 
PE. This cloning event is triggered by the PN for each unique value of the keyed 
attribute. An S4 application is a graph composed of PE prototypes and streams that 
produce, consume, and transmit messages, whereas PE instances are clones of the 
corresponding prototypes containing the state and are associated with unique keys 
(Neumeyer et al. 2011). 

As a consequence of this design S4 guarantees that all events with a specific 
value of the keyed attribute arrive at the corresponding PN and within it are routed 
to the specific PE instance (Bradic 2011). The current state of a PE is inaccessible to 
other PEs. S4 is based upon a push model: events are routed to the next PE as fast as 
possible. Therefore, if a receiver buffer fills up events may be dropped. Via lossy 
checkpointing S4 provides state recovery. In the case of a node crash a new one 
takes over its task from the most recent snapshot. The communication layer is based 
upon the Apache ZooKeeper project. It manages the cluster and provides failover 
handling to stand-by nodes. PEs are built in Java using a fairly simple API and are 
assembled into the application using the Spring framework. 


4.4.2.3 Kafka 


Kafka is a distributed publish-subscribe messaging system designed to support 
mainly persistent messaging with high-throughput. Kafka aims to unify offline 
and online processing by providing a mechanism for a parallel load into Hadoop 
as well as the ability to partition real-time consumption over a cluster of machines. 
The use for activity stream processing makes Kafka comparable to Apache Flume, 
though the architecture and primitives are very different and make Kafka more 
comparable to a traditional messaging system. 

Kafka was originally developed at LinkedIn for tracking the huge volume of 
activity events generated by the website. These activity events are critical for 
monitoring user engagement as well as improving relevancy in their data-driven 
products. The previous diagram gives a simplified view of the deployment topology 
at LinkedIn. 

Note that a single Kafka cluster handles all activity data from different sources. 
This provides a single pipeline of data for both online and offline consumers. This 
tier acts as a buffer between live activity and asynchronous processing. Kafka can 
also be used to replicate all data to a different data centre for offline consumption. 

Kafka can be used to feed Hadoop for offline analytics, as well as a way to track 
internal operational metrics that feed graphs in real time. In this context, a very 
appropriate use for Kafka and its publish-subscribe mechanism would be 
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processing related stream data, from tracking user actions on large-scale websites to 
relevance and ranking tasks. 

In Kafka, each stream is called a “topic”. Topics are partitioned for scaling 
purposes. Producers of messages provide a key which is used to determine the 
partition the message is sent to. Thus, all messages partitioned by the same key are 
guaranteed to be in the same topic partition. Kafka brokers handle some partitions 
and receive and store messages sent by producers. 

Kafka consumers read from a topic by getting messages from all partitions of the 
topic. If a consumer wants to read all messages with a specific key (e.g. a user ID in 
case of website clicks) he only has to read messages from the partition the key is on, 
not the complete topic. Furthermore, it is possible to reference any point in a 
brokers log file using an offset. This offset determines where a consumer is in a 
specific topic/partition pair. The offset is incremented once a consumer reads the 
topic/partition pair. 

Kafka provides an at-least-once messaging guarantee and highly available 
partitions. To store and cache messages Kafka relies on file systems, whereas all 
data is written immediately to a persistent log without necessarily flushing to disk. 
In combination the protocol is built upon a message set abstraction, which groups 
messages together. Therewith, it minimizes the network overhead and sequential 
disk operations. Both consumer and producer share the same message format. 


4.4.2.4 Flume 


Flume is a service for efficiently collecting and moving large amounts of log data. It 
has a simple and flexible architecture based on streaming data flows. It is robust and 
fault tolerant with tuneable reliability mechanisms and many failover and recovery 
mechanisms. It uses a simple extensible data model that allows online analytic 
applications. The system was designed with these four key goals in mind: reliabil- 
ity, scalability, manageability, and extensibility 

The purpose of Flume is to provide a distributed, reliable, and available system 
for efficiently collecting, aggregating, and moving large amounts of log data from 
many different sources to a centralized data store. The architecture of Flume NG is 
based on a few concepts that together help achieve this objective: 


e Event: a byte payload with optional string headers that represent the unit of data 
that Flume can transport from its point of origin to its final destination. 

¢ Flow: movement of events from the point of origin to their final destination is 
considered a data flow, or simply flow. 

e Client: an interface implementation that operates at the point of origin of events 
and delivers them to a Flume agent. 

¢ Agent: an independent process that hosts flume components such as sources, 
channels, and sinks, and thus has the ability to receive, store, and forward events 
to their next-hop destination. 
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e Source: an interface implementation that can consume events delivered to it via 
a specific mechanism. 

e Channel: a transient store for events, where events are delivered to the channel 
via sources operating within the agent. An event put in a channel stays in that 
channel until a sink removes it for further transport. 

¢ Sink: an interface implementation that can remove events from a channel and 
transmit them to the next agent in the flow, or to the event’s final destination. 


These concepts help in simplifying the architecture, implementation, configura- 
tion, and deployment of Flume. 

A flow in Flume NG starts from the client. The client transmits the event to its 
next-hop destination. This destination is an agent. More precisely, the destination is 
a source operating within the agent. The source receiving this event will then 
deliver it to one or more channels. The channels that receive the event are drained 
by one or more sinks operating within the same agent. If the sink is a regular sink, it 
will forward the event to its next-hop destination, which will be another agent. If 
instead it is a terminal sink, it will forward the event to its final destination. 
Channels allow for the decoupling of sources from sinks using the familiar 
producer-consumer model of data exchange. This allows sources and sinks to 
have different performance and runtime characteristics and yet be able to effec- 
tively use the physical resources available to the system. 

The primary use case for Flume is as a logging system that gathers a set of log 
files on every machine in a cluster and aggregates them to a centralized persistent 
store such as the Hadoop Distributed File System (HDFS). Also, Flume can be used 
as an HTTP event manager that deals with different types of requests and drives 
each of them to any specific data store during a data acquisition process, such as an 
NoSQL databases like HBase. 

Therefore, Apache Flume is not a pure data acquisition system but acts in a 
complementary fashion by managing the different data types acquired and 
transforming them to specific data stores or repositories. 


4.4.2.5 Hadoop 


Apache Hadoop is an open-source project developing a framework for reliable, 
scalable, and distributed computing on big data using clusters of commodity 
hardware. It was derived from Google’s MapReduce and the Google File System 
(GFS) and written in JAVA. It is used and supported by a large community and is 
both used in production and research environments by many organizations, most 
notably: Facebook, a9.com, AOL, Baidu, IBM, Imageshack, and Yahoo. The 
Hadoop project consists of four modules: 


¢ Hadoop Common: for common utilities used throughout Hadoop. 
¢ Hadoop Distributed File System (HDFS): a highly available and efficient file 
system. 
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e Hadoop YARN (Yet Another Resource Negotiator): a framework for job 
scheduling and cluster management. 
¢« Hadoop MapReduce: a system to parallel processing large amounts of data. 


A Hadoop cluster is designed according to the master-slave principle. The 
master is the name node. It keeps track of the metadata about the file distribution. 
Large files are typically split into chunks of 128 MB. These parts are copied three 
times and the replicas are distributed through the cluster of data nodes (slave 
nodes). In the case of a node failure its information is not lost; the name node is 
able to allocate the data again. To monitor the cluster every slave node regularly 
sends a heartbeat to the name node. If a slave is not recognized over a specific 
period it is considered dead. As the master node is a single point of failure it is 
typically run on highly reliable hardware. And, as precaution a secondary name 
node can keep track of changes in the metadata; with its help it is possible to rebuild 
the functionality of the name node and thereby ensure the functionality of the 
cluster. 

YARN is Hadoop’s cluster scheduler. It allocates a number of containers (which 
are essential processes) in a cluster of machines and executes arbitrary commands 
on them. YARN consists of three main pieces: a ResourceManager, a 
NodeManager, and an ApplicationMaster. In a cluster each machine runs a 
NodeManager, responsible for running processes on the local machine. Resource- 
Managers tell NodeManagers what to run, and Applications tell the 
ResourceManager when to run something on the cluster. 

Data is processed according to the MapReduce paradigm. MapReduce is a 
framework for parallel-distributed computation. As data storage processing works 
in a master-slave fashion, computation tasks are called jobs and are distributed by 
the job tracker. Instead of moving the data to the calculation, Hadoop moves the 
calculation to the data. The job tracker functions as a master distributing and 
administering jobs in the cluster. Task trackers carry out the actual work on jobs. 
Typically each cluster node is running a task tracker instance and a data node. The 
MapReduce framework eases programming of highly distributed parallel programs. 
A programmer can focus on writing the more simpler map() and reduce() functions 
dealing with the task at hand while the MapReduce infrastructure takes care of 
running and managing the tasks in the cluster. 

In the orbit of the Hadoop project a number of related projects have emerged. 
The Apache Pig project for instance is built upon Hadoop and simplifies writing and 
maintaining Hadoop implementations. Hadoop is very efficient for batch 
processing. The Apache HBase project aims to provide real-time access to big data. 
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4.5 Future Requirements and Emerging Trends for Big 
Data Acquisition 


Big data acquisition tooling has to deal with high-velocity, variety, and real-time 
data acquisition. Thus, tooling for data acquisition has to ensure a very high 
throughput. This means that data can come from multiple resources (social net- 
works, sensors, web mining, logs, etc.) with different structures, or be unstructured 
(text, video, pictures, and media files) and at a very high pace (tens or hundreds of 
thousands events per second). Therefore, the main challenge in acquiring big data is 
to provide frameworks and tools that ensure the required throughput for the 
problem at hand without losing any data in the process. 

In this context, emerging challenges for the acquisition of big data include the 
following: 


e Data acquisition is often started by tools that provide some kind of input data to 
the system, such as social networks and web mining algorithms, sensor data 
acquisition software, logs periodically injected, etc. Typically the data acquisi- 
tion process starts with single or multiple end points where the data comes from. 
These end points could take different technical appearances, such as log 
importers, Storm-based algorithms, or even the data acquisition may offer 
APIs to the external world to inject the data, by using RESTful services or any 
other programmatic APIs. Hence, any technical solution that aims to acquire 
data from different sources should be able to deal with this wide range of 
different implementations. 

¢ To provide the mechanisms to connect the data acquisition with the data pre- and 
post-processing (analysis) and storage, both in the historical and real-time 
layers. In order to do so, the batch and real-time processing tools (i.e. Storm 
and Hadoop) should be able to be contacted by the data acquisition tools. This is 
implemented in different ways. For instance Apache Kafka uses a publish- 
subscribe mechanism where both Hadoop and Storm can be subscribed, and 
therefore the messages received will be available to them. Apache Flume on the 
other hand follows a different approach, storing the data in a NoSQL key-value 
store to ensure velocity, and pushing the data to one or several receivers 
(i.e. Hadoop and Storm). There is a red thin line between data acquisition, 
storage, and analysis in this process, as data acquisition typically ends by storing 
the raw data in an appropriate master dataset, and connecting with the analytical 
pipeline (especially for real-time, but also batch processing). 

¢ To come up with a structured or semi-structured model valid for data analysis, to 
effectively pre-process acquired data, especially unstructured data. The borders 
between data acquisition and analysis are blurred in the pre-processing stage. 
Some may argue that pre-processing is part of processing, and therefore of data 
analysis, while others believe that data acquisition does not end with the actual 
gathering, but also with cleaning the data and providing a minimal set of 
coherence and metadata on top of it. Data cleaning usually takes several steps, 
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such as boilerplate removal (i.e. removing HTML headers in web mining 
acquisition), language detection and named entities recognition (for textual 
resources), and providing extra metadata such as timestamp, provenance infor- 
mation (yet another overlap with data curation), etc. 

¢ The acquisition of media (pictures, video) is a significant challenge, but it is an 
even bigger challenge to perform the analysis and storage of video and images. 

¢ Data variety requires processing the semantics in the data in order to correctly 
and effectively merge data from different sources while processing. Works on 
semantic event processing such as semantic approximations (Hasan and Curry 
2014a), thematic event processing (Hasan and Curry 2014b), and thingsonomy 
tagging (Hasan and Curry 2015) are emerging approaches in this area, within 
this context. 

¢ In order to perform post- and pre-processing of acquired data, the current state-of 
the art provides a set of open-source and commercial tools and frameworks. The 
main goal when defining a correct data acquisition strategy is therefore to 
understand the needs of the system in terms of data volume, variety, and 
velocity, and take the right decision on which tool is best to ensure the acqui- 
sition and desired throughput. 


4.6 Sector Case Studies for Big Data Acquisition 


This section analyses the use of big data acquisition technology within a number of 
sectors. 


4.6.1 Health Sector 


Within the health sector big data technology aims to establish a holistic approach 
whereby clinical, financial, and administrative data as well as patient behavioural 
data, population data, medical device data, and any other related health data are 
combined and used for retrospective, real-time, and predictive analysis. 

In order to establish a basis for the successful implementation of big data health 
applications, the challenge of data digitalization and acquisition (i.e. putting health 
data in a form suitable as input for analytic solutions) needs to be addressed. 

As of today, large amounts of health data are stored in data silos and data 
exchange is only possible via Scan, Fax, or email. Due to inflexible interfaces 
and missing standards, the aggregation of health data relies on individualized 
solutions with high costs. 

In hospitals patient data is stored in CIS (clinical information system) or EHR 
(electronic health record) systems. However, different clinical departments might 
use different systems, such as RIS (radiology information system), LIS (laboratory 
information system), or PACS (picture archiving and communication system) to 
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store their data. There is no standard data model or EHR system. Existing mech- 
anisms for data integration are either adaptations of standard data warehouse 
solutions from horizontal IT providers like Oracle Healthcare Data Model, 
Teradata’s Healthcare Logical Data Model, IBM Healthcare Provider Data 
Model, or new solutions like the i2b2 platform. While the first three are mainly 
used to generate benchmarks regarding the performance of the overall hospital 
organization, the i2b2 platform establishes a data warehouse that allows the inte- 
gration of data from different clinical departments in order to support the task of 
identifying patient cohorts. In doing so, structured data such as diagnoses and lab 
values are mapped to standardized coding systems. However, unstructured data is 
not further labelled with semantic information. Besides its main functionality of 
patient cohorts identification, the i2b2 hive offers several additional modules. 
Besides specific modules for data import, export, and visualization tasks, modules 
to create and use additional semantics are available. For example, the natural 
language processing (NLP) tool offers a means to extract concepts out of specific 
terms and connect them with structured knowledge. 

Today, data can be exchanged by using exchange formats such as HL7. How- 
ever, due to non-technical reasons such as privacy, health data is commonly not 
shared across organizations (phenomena of organizational silos). Information about 
diagnoses, procedures, lab values, demographics, medication, provider, etc., is in 
general provided in a structured format, but not automatically collected in a 
standardized manner. For example, lab departments use their own coding system 
for lab values without an explicit mapping to the LOINC (Logical Observation 
Identifiers Names and Codes) standard. Also, different clinical departments often 
use different but customized report templates without specifying the common 
semantics. Both scenarios lead to difficulties in data acquisition and consequent 
integration. 

Regarding unstructured data like texts and images, standards for describing 
high-level meta-information are only partially collected. In the imaging domain, 
the DICOM (Digital Imaging and Communications in Medicine) standard for 
specifying image metadata is available. However, for describing meta-information 
of clinical reports or clinical studies a common (agreed) standard is missing. To the 
best of our knowledge, for the representation of the content information of unstruc- 
tured data like images, texts, or genomics data, no standard is available. Initial 
efforts to change this situation are initiatives such as the structured reporting 
initiative by RSNA or semantic annotations using standardized vocabularies. For 
example, the Medical Subject Headings (MeSH) is a controlled vocabulary thesau- 
rus of the US National Library of Medicine to capture topics of texts in the medical 
and biological domain. There also exist several translations to other languages. 

Since each EHR vendor provides their own data model, there is no standard data 
model for the usage of coding systems to represent the content of clinical reports. In 
terms of the underlying means for data representation, existing EHR systems rely 
on a case-centric rather than on a patient-centric representation of health data. This 
hinders longitudinal health data acquisition and integration. 
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Easy to use structured reporting tools are required which do not create extra 
work for clinicians, i.e. these systems need to be seamlessly integrated into the 
clinical workflow. In addition, available context information should be used to 
assist the clinicians. Given that structured reporting tools are implemented as easy- 
to-use tools, they can gain acceptance by clinicians such that most of the clinical 
documentation is carried out in a semi-structured form and the quality and quantity 
of semantic annotations increases. 

From an organizational point of view, the storage, processing, access, and 
protection of big data has to be regulated on several different levels: institutional, 
regional, national, and international level. There is a need to define who authorizes 
which processes, who changes processes, and who implements process changes. 
Therefore, a proper and consistent legal framework or guidelines [e.g. ISO/IEC 
27000] for all four levels are required. 

IHE (integrating the healthcare enterprise) enables plug-and-play and secure 
access to health information whenever and wherever it is needed. It provides 
different specifications, tools, and services. IHE also promotes the use of well- 
established and internationally accepted standards (e.g. Digital Imaging and Com- 
munications in Medicine, Health Level 7). Pharmaceutical and R&D data that 
encompass clinical trials, clinical studies, population and disease data, etc. is 
typically owned by the pharmaceutical companies, research labs/academia, or the 
government. As of today, a lot of manual effort is taken to collect all the datasets for 
conducting clinical studies and related analysis. The manual effort for collecting the 
data is quite high. 


4.6.2 Manufacturing, Retail, and Transport 


Big data acquisition in the context of the retail, transportation, and manufacturing 
sectors becomes increasingly important. As data processing costs decrease and 
storage capacities increase, data can now be continuously gathered. Manufacturing 
companies as well as retailers may monitor channels like Facebook, Twitter, or 
news for any mentions and analyse these data (e.g. customer sentiment analysis). 
Retailers on the web are also collecting large amounts of data by storing log files 
and combining that information with other data sources such as sales data in order 
to analyse and predict customer behaviour. In the field of manufacturing, all 
participating devices are nowadays interconnected (e.g. sensors, RFID), such that 
vital information is constantly gathered in order to predict defective parts at an early 
stage. 

All three sectors have in common that the data comes from very heterogeneous 
sources (e.g. log files, data from social media that needs to be extracted via 
proprietary APIs, data from sensors, etc.). Data comes in at a very high pace, 
requiring that the right technologies be chosen for extraction (e.g. MapReduce). 
Challenges may also include data integration. For example, product names used by 
customers on social media platforms need to be matched against IDs used for 
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product pages on the web and then matched against internal IDs used in Enterprise 
Resource Planning (ERP) systems. Tools used for data acquisition in retail can be 
grouped by the two types of data typically collected in retail: 


e Sales data from accounting and controlling departments 
¢ Data from the marketing departments 


The dynamite data channel monitor, recently bought by Market Track LLC, 
provides a solution to gather information about product prices on more than 1 billion 
“buy” pages at more than 4000 global retailers in real time, and thus allows to study 
the impact of promotional investments, monitor prices, and track consumer senti- 
ment on brands and products. 

The increasing use of social media not only empowers consumers to easily 
compare services and products both with respect to price and quality, but also 
enables retailers to collect, manage, and analyse large volumes and velocity of data, 
providing a great opportunity for the retail industry. To gain competitive advan- 
tages, real-time information is essential for accurate prediction and optimization 
models. From a data acquisition perspective means for stream data computation are 
necessary, which can deal with the challenges of the Vs of the data. 

In order to bring a benefit for the transportation sector (especially multimodal 
urban transportation), tools that support big data acquisition have to achieve mainly 
two tasks (DHL 2013; Davenport 2013). First, they have to handle large amounts of 
personalized data (e.g. location information) and deal with the associated privacy 
issues. Second, they have to integrate data from different service providers, includ- 
ing geographically distributed sensors (i.e. Internet of Things (IoT)) and open data 
sources. 

Different players benefit from big data in the transport sector. Governments and 
public institutions use an increasing amount of data for traffic control, route 
planning, and transport management. The private sector exploits increasing 
amounts of date for route planning and revenue management to gain competitive 
advantages, save time, and increase fuel efficiency. Individuals increasingly use 
data via websites, mobile device applications, and GPS information for route 
planning to increase efficiency and save travel time. 

In the manufacturing sector, tools for data acquisition need to mainly process 
large amounts of sensor data. Those tools need to handle sensor data that may be 
incompatible with other sensor data and thus data integration challenges need to be 
tackled, especially when sensor data is passed through multiple companies in a 
value chain. 

Another category of tools needs to address the issue of integrating data produced 
by sensors in a production environment with data from, e.g. ERP systems within 
enterprises. This is best achieved when tools produce and consume standardized 
metadata formats. 
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4.6.3 Government, Public, Non-profit 


Integrating and analysing large amounts of data play an increasingly important role 
in today’s society. Often, however, new discoveries and insights can only be 
attained by integrating information from dispersed sources. Despite recent 
advances in structured data publishing on the web (such as using RDF in attributes 
(RDFa) and the schema.org initiative), the question arises how larger datasets can 
be published in a manner that makes them easily discoverable and facilitates 
integration as well as analysis. 

One approach for addressing this problem is data portals, which enable organi- 
zations to upload and describe datasets using comprehensive metadata schemes. 
Similar to digital libraries, networks of such data portals can support the descrip- 
tion, archiving, and discovery of datasets on the web. Recently, a rapid growth has 
been seen of data catalogues being made available on the web. The data catalogue 
registry datacatalogs.org lists 314 data catalogues worldwide. Examples for the 
increasing popularity of data catalogues are Open Government Data portals, data 
portals of international organizations and NGOs, as well as scientific data portals. In 
the public and governmental sector a few catalogues and data hubs can be used to 
find metadata or at least to find locations (links) to interesting media files such as 
publicdata.eu. 

The public sector is centred around the activities of the citizens. Data acquisition 
in the public sector includes tax collection, crime statistics, water and air pollution 
data, weather reports, energy consumption, Internet business regulation: online 
gaming, online casinos, intellectual property protection, and others. 

The open data initiatives of the governments (data.gov, data.gov.uk for open 
public data, or govdata.de) are recent examples of the increasing importance of 
public and non-profit data. There exist similar initiatives in many countries. Most 
data collected by public institutions and governments of these countries is in 
principle available for reuse. The W3C guidance on opening up government data 
(Bennett and Harvey 2009) suggests that data should be published as soon as 
available in the original raw format, then to enhance it with semantics and meta- 
data. However, in many cases governments struggle to publish certain data, due to 
the fact that the data needs to be strictly non-personal and non-sensitive and 
compliant with data privacy and protection regulations. Many different sectors 
and players can benefit from this public data. 

The following presents several case studies for implementing big data technol- 
ogies in different areas of the public sector. 


4.6.3.1 Tax Collection Area 


One key area for big data solutions is for the tax revenue recovery of millions of 
dollars per year. The challenge for such an application is to develop a fast, accurate 
identity resolution and matching capability for a budget-constrained, limited- 
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staffed state tax department in order to determine where to deploy scarce auditing 
resources and enhance tax collection efficiency. The main implementation high- 
lights are: 


¢ Rapidly identify exact and close matches 

¢ Enable de-duplication from data entry errors 

¢ High throughput and scalability handles growing data volumes 

e Quickly and easily accommodate file format changes, and addition of new data 
sources 


One solution is based on software developed by the Pervasive Software com- 
pany: the Pervasive DataRush engine, the Pervasive DataMatcher, and the Perva- 
sive Data Integrator. Pervasive DataRush provides simple constructs to: 


e Create units of work (processes) that can each individually be made parallel. 

¢ Tie processes together in a dataflow graph (assemblies), but then enable the 
reuse of complex assemblies as simple operators in other applications. 

¢ Further tie operators into new, broader dataflow applications. 

e Run a compiler that can traverse all sub-assemblies while executing customizers 
to automatically define parallel execution strategies based on then-current 
resources and/or more complex heuristics (this will only improve over time). 


This is achieved using techniques such as fuzzy matching, record linking, and 
the ability to match any combination of fields in a dataset. Other key techniques 
include data integration and Extract, Transform, Load (ETL) processes that save 
and store all design metadata in an open XML-based design repository for easy 
metadata interchange and reuse. This enables fast implementation and deployment 
and reduces the cost of the entire integration process. 


4.6.3.2 Energy Consumption 


An article reports on the problems in the regulation of energy consumption. The 
main issue is that when energy is put on the distribution network it must be used at 
that time. Energy providers are experimenting with storage devices to assist with 
this problem, but they are nascent and expensive. Therefore the problem is tackled 
with smart metering devices. 

When collecting data from smart metering devices, the first challenge is to store 
the large volume of data. For example, assuming that | million collection devices 
retrieve 5 kB of data per single collection, the potential data volume growth in a 
year can be up to 2920 TB. 

The consequential challenges are to analyse this huge volume of data, cross- 
reference that data with customer information, network distribution, and capacity 
information by segment, local weather information, and energy spot market 
cost data. 

Harnessing this data will allow the utilities to better understand the cost structure 
and strategic options within their network, which could include: 
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e Adding generation capacity versus purchasing energy off the spot market 
(e.g. renewables such as wind, solar, electric cars during off-peak hours) 

¢ Investing in energy storage devices within the network to offset peak usage and 
reduce spot purchases and costs 

e Provide incentives to individual consumers, or groups of consumers, to change 
energy consumption behaviours 


One such approach from the Lavastorm company is a project that explores 
analytics problems with innovative companies such as FalbygdensEnergi AB 
(FEAB) and Sweco. To answer key questions, the Lavastorm Analytic Platform 
is utilized. The Lavastorm Analytics Engine is a self-service business analytics 
solution that empowers analysts to rapidly acquire, transform, analyse, and visual- 
ize data, and share key insights and trusted answers to business questions with 
non-technical managers and executives. The engine offers an integrated set of 
analytics capabilities that enables analysts to independently explore enterprise 
data from multiple data sources, create and share trusted analytic models, produce 
accurate forecasts, and uncover previously hidden insights in a single, highly visual 
and scalable environment. 


4.6.4 Media and Entertainment 


Media and entertainment is centred on knowledge included in the media files. With 
the significant growth of media files and associated metadata, due to evolution of 
the Internet and the social web, data acquisition in this sector has become a 
substantial challenge. 

According to a Quantum report, managing and sharing content can be a chal- 
lenge, especially for media and entertainment industries. With the need to access 
video footage, audio files, high-resolution images, and other content, a reliable and 
effective data sharing solution is required. 

Commonly used tools in the media and entertainment sector include: 


e Specialized file systems that are used as a high-performance alternative to NAS 
and network shares 

e Specialized archiving technologies that allow the creation of a digital archive 
that reduces costs and protects content 

e Specialized clients that enable both LAN-based applications and SAN-based 
applications to share a single content pool 

e Various specialized storage solutions (for high-performance file sharing, cost- 
effective near-line storage, offline data retention, for high-speed primary 
storage) 


Digital on-demand services have radically changed the importance of schedules 
for both consumers and broadcasters. The largest media corporations have already 
invested heavily in the technical infrastructure to support the storage and streaming 
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of content. For example, the number of legal music download and streaming sites, 
and Internet radio services, has increased rapidly in the last few years—consumers 
have an almost-bewildering choice of options depending on what music genres, 
subscription options, devices, Digital rights management (DRM) they like. Over 
391 million tracks were sold in Europe in 2012, and 75 million tracks played on 
online radio stations. 

According to Eurostat, there has been a massive increase in household access to 
broadband in the years since 2006. Across the “EU27” (EU member states and six 
other countries in the European geographical area) broadband penetration was at 
around 30 % in 2006 but stood at 72 % in 2012. For households with high-speed 
broadband, media streaming is a very attractive way of consuming content. Equally, 
faster upload speeds mean that people can create their own videos for social media 
platforms. 

There has been a huge shift away from mass, anonymized mainstream media, 
towards on-demand, personalized experiences. Large-scale shared consumer expe- 
riences such as major sporting events, reality shows, and soap operas are now 
popular. Consumers expect to be able to watch or listen to whatever they want, 
whenever they want. 

Streaming services put control in the hands of users who choose when to 
consume their favourite shows, web content, or music. The largest media corpora- 
tions have already invested heavily in the technical infrastructure to support the 
storage and streaming of content. 

Media companies hold significant amounts of personal data, whether on cus- 
tomers, suppliers, content, or their own employees. Companies have responsibility 
not just for themselves as data controllers, but also their cloud service providers 
(data processors). Many large and small media organizations have already suffered 
catastrophic data breaches—two of the most high-profile casualties were Sony and 
LinkedIn. They incurred not only the costs of fixing their data breaches, but also 
fines from data protection bodies such as the Information Commissioner’s Office 
(ICO) in the UK. 


4.6.5 Finance and Insurance 


Integrating large amounts of data with business intelligence systems for analysis 
plays an important role in financial and insurance sectors. Some of the major areas 
for acquiring data in these sectors are exchange markets, investments, banking, 
customer profiles, and behaviour. 

According to McKinsey Global Institute Analysis, “Financial Services has the 
most to gain from big data”. For ease of capturing and value potential, “financial 
players get the highest marks for value creation opportunities”. Banks can add value 
by improving a number of products, e.g., customizing UX, improved targeting, 
adapting business models, reducing portfolio losses and capital costs, office effi- 
ciencies, and new value propositions. Some of the publicly available financial data 
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are provided by international statistical agencies like Eurostat, World Bank, 
European Central Bank, International Monetary Fund, International Financial Cor- 
poration, Organization for Economic Co-operation and Development. While these 
data sources are not as time sensitive in comparison to exchange markets, they 
provide valuable complementary data. 

Fraud detection is an important topic in finance. According to the Global Fraud 
Study 2014, a typical organization loses about 5 % of revenues each year to fraud. 
The banking and financial services sector has a great number of frauds. Approxi- 
mately 30 % of fraud schemes were detected by tip off and up to 10 % by accident, 
but only up to 1 % by IT controls (ACFE 2014). Better and improved fraud 
detection methods rely on real-time analysis of big data (Sensmeier 2013). For 
more accurate and less intrusive fraud detection method, banks and financial service 
institutions are increasingly using algorithms that rely on real-time data about 
transactions. These technologies make use of large volumes of data being generated 
at a high velocity and from hybrid sources. Often, data from mobile sources and 
social data such as geographical information is used for prediction and detection 
(Krishnamurthy 2013). By using machine-learning algorithms, modern systems are 
able to detect fraud more reliably and faster (Sensmeier 2013). But there are 
limitations for such systems. Because financial services operate in a regulatory 
environment, the use of customer data is subject to privacy laws and regulations. 


4.7 Conclusions 


Data acquisition is an important process and enables the subsequent tools of the 
data value chain to do their work properly (e.g. data analysis tools). The state of the 
art regarding data acquisition tools showed that there are plenty of tools and 
protocols, including open-source solutions that support the process of data acqui- 
sition. Many of these tools have been developed and are operational within pro- 
duction environments or major players such as Facebook or Amazon. 

Nonetheless there are many open challenges to successfully deploy effective big 
data solutions for data acquisition in the different sectors (see section “Future 
Requirements and Emerging Trends for Big Data Acquisition”). The main issue 
remains producing highly scalable robust solutions for today and researching next 
generation systems for the ever-increasing industrial requirements. 
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Chapter 5 
Big Data Analysis 


John Domingue, Nelia Lasierra, Anna Fensel, Tim van Kasteren, 
Martin Strohbach, and Andreas Thalhammer 


5.1 Introduction 


Data comes in many forms and one dimension to consider and compare differing 
data formats is the amount of structure contained therein. The more structure a 
dataset has the more amenable it will be to machine processing. At the extreme, 
semantic representations will enable machine reasoning. Big data analysis is the 
sub-area of big data concerned with adding structure to data to support decision- 
making as well as supporting domain-specific usage scenarios. This chapter out- 
lines key insights, state of the art, emerging trends, future requirements, and 
sectorial case studies for data analysis. 

The position of big data analysis within the overall big data value chain can be 
seen in Fig. 5.1. ‘Raw’ data which may or may not be structured and which will 
usually be composed of many different formats is transformed to be ready for data 
curation, data storage, and data usage. That is why without big data analysis most of 
the acquired data would be useless. 
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Big Data Value Chain 


Data Data Data Data Data 
Acquisition Analysis Curation Storage Usage 


+ Structured data + Data Quality * In-Memory DBs + Decision support 
+ Unstructured + Trust / Provenance + NoSQL DBs + Prediction 
data + Annotation + NewSQL DBs + In-use analytics 
+ Event + Data validation + Cloud storage + Simulation 
processing + Human-Data * Query Interfaces + Exploration 
+ Sensor Interaction * Scalability and + Visualisation 
networks + Top-down/Bottom- Performance + Modeling 
+ Protocols up + Data Models + Control 
+ Real-time * Community / + Consistency, + Domain-specific 
+ Data streams Crowd Availability, usage 
+ Multimodality + Human Partition-tolerance 


Computation 


+ Curation at scale 
+ Incentivisation 


* Security and 


Privacy 


* Standardization 


+ Automation 
+ Interoperability 


Fig. 5.1 Data analysis in the big data value chain 


The analysis found that the following generic techniques are either useful today 
or will be in the short to medium term: reasoning (including stream reasoning), 
semantic processing, data mining, machine learning, information extraction, and 
data discovery. 

These generic areas are not new. What is new however are the challenges raised 
by the specific characteristics of big data related to the three Vs: 


e Volume—places scalability at the centre of all processing. Large-scale reason- 
ing, semantic processing, data mining, machine learning, and information 
extraction are required. 

e Velocity—this challenge has resulted in the emergence of the areas of stream 
data processing, stream reasoning, and stream data mining to cope with high 
volumes of incoming raw data. 

e Variety—may take the form of differing syntactic formats (e.g. spreadsheet 
vs. csv) or differing data schemas or differing meanings attached to the same 
syntactic forms (e.g. ‘Paris’ as a city or person). Semantic techniques, especially 
those related to Linked Data, have proven to be the most successful applied thus 
far although scalability issues remain to be addressed. 


5.2 Key Insights for Big Data Analysis 


Interviews with various stakeholders related to big data analysis have identified the 
following key insights. A full list of interviewees is given in Table 3.1. 


Old Technologies Applied in a New Context Individual and combinations of old 
technologies being applied in the Big Data context. The difference is the scale 
(volume) and the amount of heterogeneity encountered (variety). Specifically, in 
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Table 3.1 Big data analysis interviewees 


First 
No. | name Last name | Organization Role/Position 
1 Soren Auer Leipzig Professor 
2 Ricardo | Baeza- Yahoo! VP of Research 
Yates 
3 François | Bancilhon | Data Publica CEO 
4 Richard | Benjamins | Telefoncica Director Biz Intel 
5 Hjalmar | Gislason datamarket.com Founder 
6 Alon Halvey Google Research Scientist 
7 Usman Haque Cosm (Pachube) Director Urban Project Division 
8 Steve Harris Garlik/Experian CTO 
9 Jim Hendler RPI Professor 
10 | Alek Kołcz Twitter Data Scientist 
11 | Prasanna | Lal Das World Bank Snr Prog. Officer, Head of Open 
Financial Data Program 
12 | Peter Mika Yahoo! Researcher 
13 | Andreas | Ribbrock | Teradata GmbH Team Lead Big Data Analytics 
and Senior Architect 
14 | Jeni Tennison | Open Data Institute Technical Director 
15 | Bill Thompson | BBC Head of Partner Development 
16 | Andraž Tori Zemanta Owner and CTO 
17 | Frank van Amsterdam Professor 
Harmelen 
18 | Marco Viceconti | University of Sheffield and | Professor and Director 
the VPH Institute 
19 | Jim Webber Neo Chief Scientist 


the web context a focus is seen on large semantically based datasets such as 
Freebase and on the extraction of high-quality data from the web. Besides scale 
there is novelty in the fact that these technologies come together at the same time. 


Stream Data Mining This is required to handle high volumes of stream data that 
will come from sensor networks or online activities from high numbers of users. 
This capability would allow organizations to provide highly adaptive and accurate 
personalization. 


‘Good’ Data Discovery Recurrent questions asked by users and developers are: 
Where can we get the data about X? Where can we get information about Y? It is 
hard to find the data and found data is often out of date and not in the right format. 
Crawlers are needed to find big datasets, metadata for big data, meaningful links 
between related datasets, and a dataset ranking mechanism that performs as well as 
Page Rank does for web documents. 


Dealing with Both Very Broad and Very Specific Data A near feature about 
information extraction from the web is that the web is about everything so coverage 
is broad. Pre-web the focus was on specific domains when building databases and 
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knowledge bases. This can no longer be done in the context of the web. The whole 
notion of “conceptualizing the domain” is altered: Now the domain is everything in 
the world. On the positive side, the benefit is you get a lot of breadth, and the 
research challenge is how one can go deeper into a domain while maintaining the 
broad context. 


Simplicity Leads to Adoptability Hadoop' succeeded because it is the easiest 
tool to use for developers, changing the game in the area of big data. It did not 
succeed because it was the best but because it was the easiest to use (along with 
HIVE).” Hadoop managed to successfully balance dealing with complexity 
(processing big data) and simplicity for developers. Conversely, semantic technol- 
ogies are often hard to use. Hjalmar Gislason, one of our interviewees advocates the 
need for the “democratisation of semantic technologies”. 


Ecosystems Built around Collections of Tools Have a Significant Impact These 
are often driven by large companies where a technology is created to solve an 
internal problem and then is given away. Apache Cassandra? is an example of this 
initially developed by Facebook to power their inbox search feature until 2010. The 
ecosystem around Hadoop is perhaps the best known. 


Communities and Big Data Will Be Involved in New and Interesting Relation- 
ships Communities will be engaged with big data in all stages of the value chain 
and in a variety of ways. In particular, communities will be involved intimately in 
data collection, improving data accuracy and data usage. Big data will also enhance 
community engagement in society in general. 


Cross-sectorial Uses of Big Data Will Open Up New Business Opportunities 
The retail section of future requirements and emerging trends describes an example 
for this. O2 UK together with Telefónica Digital has recently launched a service 
that maps and repurposes mobile data for the retail industry. This service allows 
retailers to plan where to site retail outlets based upon the daily movement of 
potential customers. This service highlights the importance of internal big data 
(in this case mobile records) that is later combined with external data sources 
(geographical and preference data) to generate new types of business. In general 
aggregating data across organizations and across sectors will enhance the compet- 
itiveness of European industry. 

The biggest challenge for most industries is now to incorporate big data tech- 
nologies in their processes and infrastructures. Many companies identify the need 
for doing big data analysis, but do not have the resources for setting up an 
infrastructure for analysing and maintaining the analytics pipeline (Benjamins). 
Increasing the simplicity of the technology will aid the adoption rate. On top of this 
a large body of domain knowledge has to be built up within each industry on how 


! http://hadoop.apache.org/ 
a https://hive.apache.org/ 
7 http://cassandra.apache.org/ 
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data can be used: What is valuable to extract and what output can be used in daily 
operations. 

The costs of implementing big data analytics are a business barrier for big data 
technology adoption. Anonymity, privacy, and data protection are cross-sectorial 
requirements highlighted for big data technologies. Additional information can be 
found in the final analysis of sector’s requisites (Zillner et al. 2014). Examples of 
some sectorial case studies can be found in Sect. 5.5. 


5.3 Big Data Analysis State of the Art 


Industry is today applying large-scale machine learning and other algorithms for the 
analysis of huge datasets, in combination with complex event processing and 
stream processing for real-time analytics. It was also found that the current trends 
on Linked Data, semantic technologies, and large-scale reasoning are some of the 
topics highlighted by the interviewed experts in relation to the main research 
challenges and main technological requirements for big data. 

This section presents a state-of-the-art review regarding big data analysis and 
published literature, outlining a variety of topics ranging from working efficiently 
with data to large-scale data management. 


5.3.1 Large-Scale: Reasoning, Benchmarking, and Machine 
Learning 


The size and heterogeneity of the web precludes performing full reasoning and 
requires new technological solutions to satisfy the requested inference capabilities. 
This requested feature has also been extended to machine-learning technologies and 
these technologies are required in order to extract useful information from huge 
amounts of data. Specifically, François Bancilhon mentioned in his interview how 
machine learning is important for topic detection and document classification at 
Data Publica. Then, Ricardo Baeza- Yates highlighted in his interview the need for 
standards in big data computation in order to allow big data providers to compare 
their systems. 


5.3.1.1 Large-Scale Reasoning 


The promise of reasoning as promoted within the context of the semantic web does 
not currently match the requirements of big data due to scalability issues. Reason- 
ing is defined by certain principles, such as soundness and completeness, which are 
far from the practical world and the characteristics of the web, where data is often 
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contradictory, incomplete, and of an overwhelming size. Moreover, there exists a 
gap between reasoning at web scale and the more tailored reasoning over simplified 
subsets of first-order logic, due to the fact that many aspects are assumed, which 
differ from reality (e.g. small set of axioms and facts, completeness and correctness 
of inference rules). 

State-of-the-art approaches (Fensel 2007) propose a combination of reasoning 
and information retrieval methods (based on search techniques), to overcome the 
problems of web scale reasoning. Incomplete and approximate reasoning was 
highlighted by Frank van Harmelen as an important topic in his interview. 

Querying and reasoning over structured data can be supported by semantic 
models automatically built from word co-occurrence patterns from large text 
collections (distributional semantic models) (Turney and Pantel 2010). Distribu- 
tional semantic models provide a complementary layer of meaning for structured 
data, which can be used to support semantic approximation for querying and 
reasoning over heterogeneous data (Novacek et al. 2011; Freitas et al. 2013; Freitas 
and Curry 2014). 

The combination of logic-based reasoning with information retrieval is one of 
the key aspects to these approaches and also machine-learning techniques, which 
provide a trade-off between the full-fledged aspects of reasoning and the practical- 
ity of these in the web context. When the topic of scalability arises, storage systems 
play an important role as well, especially the indexing techniques and retrieval 
strategies. The trade-off between online (backward) reasoning and offline (forward) 
reasoning was mentioned by Frank van Harmelen in his interview. Peter Mika 
outlined as well the importance of efficient indexing techniques in his interview. 

Under the topic of large-scale systems, LarKC (Fensel et al. 2008) is a flagship 
project. LarKC* was an EU FP7 Large-Scale Integrating Project and the aim of it 
was to deal with large scalable reasoning systems and techniques using semantic 
technologies. 


5.3.1.2 Benchmarking for Large-Scale Repositories 


Benchmarking is nascent for the area of large-scale semantic data processing, and 
in fact currently they are only now being produced. Particularly, the Linked Data 
Benchmark Council (LDBC) project? aims to “create a suite of benchmarks for 
large-scale graph and RDF (Resource Description Framework) data management as 
well as establish an independent authority for developing benchmarks”. A part of 
the suite of benchmarks created in LDBC is the benchmarking and testing of data 
integration and reasoning functionalities as supported by RDF systems. These 
benchmarks are focused on testing: (1) instance matching and Extract, Transform 
and Load that play a critical role in data integration, and (2) the reasoning 


4LarKC Homepage, http://www.larkc.eu, last visited 3/03/2015. 
>LDBC Homepage, http://www.ldbc.eu/, last visited 3/05/2015. 
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capabilities of existing RDF engines. Both topics are very important in practice, and 
they have both been largely ignored by existing benchmarks for Linked Data 
processing. In creating such benchmarks LDBC analyses various available scenar- 
ios to identify those that can best showcase the data integration and reasoning 
functionalities of RDF engines. Based on these scenarios, the limitations of existing 
RDF systems are identified in order to gather a set of requirements for RDF data 
integration and reasoning benchmarks. For instance, it is well known that existing 
systems do not perform well in the presence of non-standard reasoning rules 
(e.g. advanced reasoning that considers negation and aggregation). Moreover, 
existing reasoners perform inference by materializing the closure of the dataset 
(using backward or forward chaining). However, this approach might not be 
applicable when application-specific reasoning rules are provided and hence it is 
likely that improving the state of the art will imply support for hybrid reasoning 
strategies involving both backward and forward chaining, and query rewriting 
(i.e. incorporating the ruleset in the query). 


5.3.1.3 Large-Scale Machine Learning 


Machine-learning algorithms use data to automatically learn how to perform tasks 
such as prediction, classification, and anomaly detection. Most machine-learning 
algorithms have been designed to run efficiently on a single processor or core. 
Developments in multi-core architectures and grid computing have led to an 
increasing need for machine learning to take advantage of the availability of 
multiple processing units. Many programming interfaces and languages dedicated 
to parallel programming exist such as Orca MPI or OpenACC, which are useful for 
general purpose parallel programming. However, it is not always obvious how 
existing machine-learning algorithms can be implemented in a parallelized manner. 
There is a large body of research on distributed learning and data mining (Bhaduri 
et al. 2011), which encompasses machine-learning algorithms that have been 
designed specifically for distributed computing purposes. 

Rather than creating specific parallel versions of algorithms, more generalized 
approaches involve frameworks for programming machine learning on multiple 
processing units. One approach is to use a high-level abstraction that significantly 
simplifies the design and implementation of a restricted class of parallel algorithms. 
In particular the MapReduce abstraction has been successfully applied to a broad 
range of machine-learning applications. Chu et al. (2007) show that any algorithm 
fitting the statistical query model can be written in a certain summation form, which 
can be easily implemented in a MapReduce fashion and achieves a near linear 
speed-up with the number of processing units used. They show that this applies to a 
variety of learning algorithms (Chu et al. 2007). The implementations shown in the 
paper led to the first version of the MapReduce machine learning library Mahout. 

Low et al. (2010) explain how the MapReduce paradigm restricts users to using 
overly simple modelling assumptions to ensure there are no computational depen- 
dencies in processing the data. They propose the Graphlab abstraction that insulates 
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users from the complexities of parallel programming (i.e. data races, deadlocks), 
while maintaining the ability to express complex computational dependencies using 
a data graph. 

The programming languages, toolkits, and frameworks discussed allow many 
different configurations for carrying out large-scale machine learning. The ideal 
configuration to use is application dependent, since different applications will have 
different sets of requirements. However, one of the most popular frameworks used 
in recent years is that of Apache Hadoop, which is an open-source and free 
implementation of the MapReduce paradigm discussed above. Andraž Tori, one 
of our interviewees, identifies the simplicity of Hadoop and MapReduce as the main 
driver of its success. He explains that a Hadoop implementation can be 
outperformed in terms of computation time by, for example, an implementation 
using OpenMP, but Hadoop won in terms of popularity because it was easy to use. 

The parallelized computation efforts described above make it possible to process 
large amounts of data. Besides the obvious application of applying existing 
methods to increasingly large datasets, the increase in computation power also 
leads to novel large-scale machine-learning approaches. One example is the recent 
work from Le et al. (2011) in which a dataset of ten million images was used to 
teach a face detector using only unlabelled data. Using the resulting features in an 
object recognition task resulted in a performance increase of 70 % over the state of 
the art (Le et al. 2011). Utilizing large amounts of data to overcome the need for 
labelled training data could become an important trend. By using only unlabelled 
data, one of the biggest bottlenecks to the broad adoption of machine learning is 
bypassed. The use of unsupervised learning methods has its limitations though and 
it remains to be seen if similar techniques can also be applied in other application 
domains. 


5.3.2 Stream Data Processing 


Stream data mining was highlighted as a promising area of research by Ricardo 
Baeza-Yates in his interview. This technique relates to the technological capabil- 
ities needed to deal with data streams with high volume and high velocity, coming 
from sensors networks, or other online activities where a high number of users are 
involved. 


5.3.2.1 RDF Data Stream Pattern Matching 


Motivated by the huge amount of structured and unstructured data available on the 
web as continuous streams, streaming processing techniques using web technolo- 
gies have recently appeared. In order to process data streams on the web, it is 
important to cope with openness and heterogeneity. A core issue of data stream 
processing systems is to process data in a certain time frame and to be able to query 
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for patterns. Additional desired features include static data support that will not 
change over time and can be used to enhance dynamic data. Temporal operators and 
time-based windows are also typically found in these systems, used to combine 
several RDF graphs with time dependencies. Some major developments in this area 
are C-SPARQL (Barbieri et al. 2010) ETALIS (Anicic et al. 2011), and 
SPARKWAVE (Komazec et al. 2012). 

C-SPARQL is a language based on SPARQL (SPARQL Protocol and RDF 
Query Language) and extended with definitions for streams and time windows. 
Incoming triples are first materialized based on RDFS and then fed into the 
evaluation system. C-SPARQL does not provide true continuous pattern evaluation, 
due to the usage of RDF snapshots, which are evaluated periodically. However 
C-SPARQL’s strength is in situations with significant amounts of static knowledge, 
which need to be combined with dynamic incoming data streams. 

ETALIS is an event-processing system on top of SPARQL. As the pattern 
language component of SPARQL was extended with event-processing syntax, the 
pattern language is called EP-SPARQL. The supported features are temporal 
operators, out-of-order evaluation, aggregate functions, several garbage collection 
modes, and different consumption strategies. 

SPARKWAVE provides continuous pattern matching over schema-enhanced 
RDF data streams. In contrast to the C-SPARQL and EP-SPARQL, SPARK WAVE 
is fixed regarding the utilized schema and does not support temporal operators or 
aggregate functions. The benefit of having a fixed schema and no complex reason- 
ing is that the system can optimize and pre-calculate at the initialization phase the 
used pattern structure in memory, thus leading to high throughput when processing 
incoming RDF data. 


5.3.2.2 Complex Event Processing 


One insight of the interviews is that big data stream technologies can be classified 
according to (1) complex event-processing engines, and (2) highly scalable stream 
processing infrastructures. Complex event-processing engines focus on language 
and execution aspects of the business logic, while stream processing infrastructure 
provides the communication framework for processing asynchronous messages on 
a large scale. 

Complex event processing (CEP) describes a set of technologies that are able to 
process events “in stream”, i.e. in contrast to batch processing where data is inserted 
into a database and polled at regular intervals for further analysis. The advantages 
of CEP systems are their capability to process potentially large amounts of events in 
real time. The name complex event processing is due to the fact that simple events, 
e.g. from sensors or other operational data, can be correlated and processed 
generating more complex events. Such processing may happen in multiple steps, 
eventually generating an event of interest triggering a human operator or some 
business intelligence. 
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As Voisard and Ziekow point out, an event-based system “encompasses a large 
range of functionalities on various technological levels (e.g., language, execution, 
or communication)” (Voisard and Ziekow 2011). They provide a comprehensive 
survey that aids the understanding and classification of complex event-processing 
systems. 

For big data stream analytics, it is a key capability that complex event- 
processing systems are able to scale out in order to process all incoming events in 
a timely fashion as required by the application domain. For instance the smart meter 
data of a large utility company may generate millions or even billions of events per 
second that may be analysed in order to maintain the operational reliability of the 
electricity grid. Additionally, coping with the semantic heterogeneity behind mul- 
tiple data sources in a distributed event generation environment is a fundamental 
capability for big data scenarios. There are emerging automated semantic event- 
matching approaches (Hasan and Curry 2014) that target scenarios with heteroge- 
neous event types. Examples of complex event-processing engines include the SAP 
Sybase Event Stream Processor, IBM InfoSphere Stream,° and ruleCore’ to name 
just a few. 


5.3.3 Use of Linked Data and Semantic Approaches to Big 
Data Analysis 


According to Tim Berners-Lee and his colleagues (Bizer et al. 2009), “Linked Data 
is simply about using the Web to create typed links between data from different 
sources”. Linked data refers to machine-readable data, linked to other datasets and 
published on the web according to a set of best practices built upon web technol- 
ogies such as HTTP (Hypertext Transfer Protocol), RDF, and URIs (Uniform 
Resource Identifier).® Semantic technologies such as SPARQL, OWL, and RDF 
allow one to manage and deal with these. Building on the principles of Linked Data, 
a dataspace groups all relevant data sources into a unified shared repository (Heath 
and Bizer 2011). Hence, a dataspace offers a good solution to cover the heteroge- 
neity of the web (large-scale integration) and deal with broad and specific types 
of data. 

Linked data and semantic approaches to big data analysis have been highlighted 
by a number of interviewees including Soren Auer, Frangois Bancilhon, Richard 
Benjamins, Hjalmar Gislason, Frank van Harmelen, Jim Hendler, Peter Mika, and 
Jeni Tennison. These technologies were highlighted as they address important 
challenges related to big data including efficient indexing, entities extraction and 
classification, and search over data found on the web. 


© http://www-01.ibm.com/software/data/infosphere/streams, last visited 25/02/2014. 
7 RuleCore Homepage, http://www.rulecore.com/, last visited 13/02/2014. 
s http://www.w3.org/standards/semanticweb/data 
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5.3.3.1 Entity Summarization 


To the best of our knowledge, entity summarization was first mentioned in Cheng 
et al. (2008). The authors present Falcons which “... provides keyword-based 
search for Semantic Web entities”. Next to features such as concept search, 
ontology and class recommendation, and keyword-based search, the system also 
describes a popularity-based approach for ranking statements an entity is involved 
in. Further, the authors also describe the use of the MMR technique (Carbonell and 
Jade 1998) to re-rank statements to account for diversity. In a later publication 
(Cheng 2011), entity summarization requires “. . . ranking data elements according 
to how much they help identify the underlying entity”. This statement accounts for 
the most common definition of entity summarization: the ranking and selection of 
statements that identify or define an entity. 

In Singhal (2012), the author introduces Google’s Knowledge Graph. Next to 
entity disambiguation (“Find the right thing”) and exploratory search (“Go deeper 
and broader”), the knowledge graph also provides summaries of entities, i.e. “get 
the best summary”. Although not explained in detail, Google points out that they 
use the search queries of users for the summaries.” For the knowledge graph 
summaries, Google uses a unique dataset of millions of daily queries in order to 
provide concise summaries. Such a dataset is, however, not available to all content 
providers. 

As an alternative, Thalhammer et al. (2012b) suggest using the background data 
of consumption patterns of items in order to derive summaries of movie entities. 
The idea stems from the field of recommender systems where item neighbourhoods 
can be derived by the co-consumption behaviour of users (i.e. through analysing the 
user-item matrix). 

A first attempt to standardize the evaluation of entity summarization is provided 
by Thalhammer et al. (2012a). The authors suggest a game with a purpose (GWAP) 
in order to produce a reference dataset for entity summarization. In the description, 
the game is designed as a quiz about movie entities from Freebase. In their 
evaluation, the authors compare the summaries produced by Singhal (2012) and 
the summaries of Thalhammer et al. (2012b). 


5.3.3.2 Data Abstraction Based on Ontologies and Communication 
Workflow Patterns 


The problem of communication on the web, as well as beyond it, is not trivial, 
considering the rapidly increasing amount of channels (content sharing platforms, 
social media and networks, variety of devices) and audiences to be reached. To 
address this problem, technological solutions are being developed such as the one 
presented by Fensel et al. (2012) based on semantics. Data management via 


? http://insidesearch. blogspot.co.at/2012/05/introducing-knowledge-graph-things-not.html 
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semantic techniques can certainly facilitate the communication abstraction and also 
increase automation and reduce the overall effort. 

Inspired by the work of Mika (2005), eCommunication workflow patterns 
(e.g. typical query response patterns for online communication), which are usable 
and adaptable to the needs of the social web, can be defined (Stavrakantonakis 
2013a, b). Moreover, there is an interest in social network interactions (Fuentes- 
Fernandez et al. 2012). The authors of the last work coined “social property” as a 
network of activity theory concepts with a given meaning. Social properties are 
considered as “patterns that represent knowledge grounded in the social sciences 
about motivation, behaviour, organization, interaction” (Fuentes-Fernandez 
et al. 2012). The results of this research direction combined with the generic 
work flow patterns described in Van Der Aalst et al. (2003) are highly relevant 
with the materialization of the communication patterns. The design of the patterns 
is also related to the collaboration among the various agents as described in Dorn 
et al. (2012) in the scope of the social workflows. Aside from the social properties, 
the work described in Rowe et al. (2011) introduces the usage of ontologies in the 
modelling of the user’s activities in conjunction with content and sentiment. In the 
context of the approach, modelling behaviours enable one to identify patterns in 
communication problems and understand the dynamics in discussions in order to 
discover ways of engaging more efficiently with the public in the social web. 
Several researchers have proposed the realization of context-aware work flows 
(Wieland et al. 2007) and social collaboration processes (Liptchinsky et al. 2012), 
which are related to the idea of modelling the related actors and artefacts in order to 
enable adaptiveness and personalization in the communication patterns 
infrastructure. 


5.4 Future Requirements and Emerging Trends for Big 
Data Analysis 


5.4.1 Future Requirements for Big Data Analysis 
5.4.1.1 Next Generation Big Data Technologies 


Current big data technologies such as Apache Hadoop have matured well over the 
years into platforms that are widely used within various industries. Several of our 
interviewees have identified future requirements that the next generation of big data 
technologies should address: 


¢ Handle the growth of the Internet (Baeza-Yates)—as more users come online 
big data technologies will need to handle larger volumes of data. 

e Process complex data types (Baeza-Yates)—data such as graph data and possi- 
ble other types of more complicated data structures need to be easily processed 
by big data technologies. 
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¢ Real-time processing (Baeza-Yates)—big data processing was initially carried 
out in batches of historical data. In recent years, stream processing systems such 
as Apache Storm have become available and enable new application capabili- 
ties. This technology is relatively new and needs to be developed further. 

e Concurrent data processing (Baeza-Yates)—being able to process large quan- 
tities of data concurrently is very useful for handling large volumes of users at 
the same time. 

¢ Dynamic orchestration of services in multi-server and cloud contexts (Tori)— 
most platforms today are not suitable for the cloud and keeping data consistent 
between different data stores is challenging. 

¢ Efficient indexing (Mika)—indexing is fundamental to the online lookup of data 
and is therefore essential in managing large collections of documents and their 
associated metadata. 


5.4.1.2 Simplicity 


The simplicity of big data technologies refers to how easily developers are able to 
acquire the technology and use it in their specific environment. Simplicity is 
important as it leads to a higher adoptability of the technology (Baeza-Yates). 
Several of our interviewees have identified the critical role of simplicity in current 
and future big data technologies. 

The success of Hadoop and MapReduce is mainly due to its simplicity (Tori). 
Other big data platforms are available that can be considered as more powerful, but 
have a smaller community of users because their adoption is harder to manage. 
Similarly, Linked Data technologies, for example, RDF SPARQL, have been 
reported as overly complex and containing too steep a learning curve (Gislason). 
Such technologies seem to be over-designed and overly complicated—suitable only 
for use by specialists. 

Overall, there exist some very mature technologies for big data analytics, but 
these technologies need to be industrialized and made accessible to everyone 
(Benjamins). People outside of the core big data community should become 
aware of the possibilities of big data, to obtain wider support (Das). Big data is 
moving beyond the Internet industry and into other non-technical industries. An 
easy-to-use big data platform will help in the adoption of big data technologies by 
non-technical industries. 


5.4.1.3 Data 


An obvious key ingredient to big data solutions is the data itself. Our interviewees 
identified several issues that need to be addressed. 

Large companies such as Google and Facebook are working on big data and they 
will focus their energies on certain areas and not on others. EU involvement could 
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support a big data ecosystem that encourages a variety of small, medium, and large 
players, where regulation is effective and data is open (Thompson). 

In doing so, it is important to realize that there is far more data out there than 
most people realize and this data could help us to make better decisions to identify 
threats and see opportunities. A lot of the data needed already exists, but it is not 
easy to find and use this data. Solving this issue will help businesses, policy makers, 
and end users in decision-making. Just making more of the world’s data available at 
people’s fingertips will have a substantial effect overall. There will be a significant 
impact for this item in emergency situations such as earthquakes and other natural 
disasters (Halevy) (Gislason). 

However, making data available in pre-Internet companies and organizations is 
difficult. In Internet companies, there was a focus on using collected data for 
analytic purposes from the very beginning. Pre-Internet companies face issues 
with privacy, legal as well as technical, and process restrictions in repurposing 
the data. This holds even for data that is already available in digital form, such as 
call detail records for telephone companies. The processes around storing and using 
such data were never set up with the intention of using the data for analytics 
(Benjamins). 

Open data initiatives can play an important role in helping companies and 
organizations get the most out of data. Once a dataset has gone through the 
necessary validations with regard to privacy and other restrictions, it can be reused 
for multiple purposes by different companies and organizations and can serve as a 
platform for new business (Hendler). It is therefore important to invest in processes 
and legislation that support open data initiatives. Achieving an acceptable policy 
seems challenging. As one of our interviewees’ notes, there is an inherent tension 
between open data and privacy—it may not be possible to truly have both (Tori). 
But also closed datasets should be addressed. A lot of valuable information, such as 
cell phone data, is currently closed and owned by the telecom industry. The EU 
should look into ways to make such data available to the big data community, while 
taking into account the associated cost of making the data open. Also, how the 
telecom industry can benefit from making data open while taking into account any 
privacy concerns (Das). The web can also serve as an important data source. 
Companies such as Data Publica rely on snapshots of the web (which are 60-70 
terabytes) to support online services. Freely available versions of web snapshots are 
available, but more up-to-date versions are preferred. These do not necessarily have 
to be free, but cheap. The big web players such as Google and Facebook have 
access to data related to searches and social networks that have important societal 
benefit. For example, dynamic social processes such as the spread of disease or rates 
of employment are often most accurately tracked by Google searches. The EU may 
want to prioritize the European equivalent of these analogous to the way the 
Chinese have cloned Google and Twitter (Bancilhon). 

As open datasets become more common, it becomes increasingly challenging to 
discover the dataset needed. One prediction estimates that by 2015 there will be 
over 10 million datasets available on the web (Hendler). Valuable lessons can be 
learnt from how document discovery evolved on the web. Early on there was a 
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registry—all of the web could be listed on a single web page; then users and 
organizations had their own lists; then lists of lists. Later Google came to dominate 
by providing metrics on how documents link to other documents. If an analogy is 
drawn to the data area, it is currently in the registry era. It needs crawlers to find big 
datasets, good dataset metadata on contents, links between related datasets, and a 
relevant dataset ranking mechanism (analogous to page rank). A discovery mech- 
anism that can only work with good quality data will drive data owners to publish 
their data in a better way, analogous to the way that search engine optimization 
(SEO) drives the quality of the current web (Tennison). 


5.4.1.4 Languages 


Most of the big data technologies originated in the United States and therefore have 
primarily been created with the English language in mind. The majority of the 
Internet companies serve an international audience and many of their services are 
eventually translated into other languages. Most services are initially launched in 
English though and are only translated once they gain popularity. Furthermore, 
certain language-related technology optimizations (e.g. search engine optimiza- 
tions) might work well for English, but not for other languages. In any case, 
languages need to be taken into account at the very beginning, especially in Europe, 
and should play an import role in creating big data architectures (Halevy). 


5.4.2 Emerging Paradigms for Big Data Analysis 
5.4.2.1 Communities 


The rise of the Internet makes it possible to quickly reach a large audience and grow 
communities around topics of interest. Big data is starting to play an increasingly 
important role in that development. Our interviewees have mentioned this emerging 
paradigm on a number of occasions. 


e Rise of data journalists: Who are able to write interesting articles based on data 
uploaded by the public to infrastructure such as the Google Fusion Tables. The 
Guardian journalist Simon Rogers won the Best UK Internet Journalist award for 
his work’® based on this platform. A feature of journalistic take-up is that data 
blogs typically have a high dissemination impact (Halevy). 

¢ Community engagement in local political issues: Two months after the school 
massacre in Connecticut’! local citizens started looking at data related to gun 


10 http://www.oii.ox.ac.uk/news/?id=576 
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permit applications in two locations and exposed this on a map.'* This led to a 
huge discussion on the related issues (Halevy). 

¢ Engagement through community data collection and analysis: The company 
COSM (formerly Pachube) has been driving a number of community-led efforts. 
The main idea behind these is that the way data is collected introduces specific 
slants on how the data can be interpreted and used. Getting communities 
involved has various benefits: the number of data collection points can be 
dramatically increased; communities will often create bespoke tools for the 
particular situation and to handle any problems in data collection; and citizen 
engagement is increased significantly. 

In one example, the company crowd sourced real-time radiation monitoring in 
Japan following the problem with reactors in Fukushima. There are now hun- 
dreds of radiation-related feeds from Japan on Pachube, monitoring conditions 
in real time and underpinning more than half a dozen incredibly valuable 
applications built by people around the world. These combine “official” data, 
“unofficial” data, and also real-time networked Geiger counter measurements 
contributed by concerned citizens (Haque). 

e Community engagement to educate and improve scientific involvement: 
Communities can be very useful in collecting data. Participation in such projects 
allows the public to obtain a better understanding of certain scientific activities 
and therefore helps to educate people in these topics. That increase in under- 
standing will further stimulate the development and appreciation of upcoming 
technologies and therefore result in a positive self-reinforcing cycle 
(Thompson). 

¢ Crowdsourcing to improve data accuracy: Through crowdsourcing the preci- 
sion of released UK Government data on the location of bus stops was dramat- 
ically increased (Hendler). 


These efforts play well into the future requirements section on data. A 
community-driven approach to creating datasets will stimulate data quality and 
lead to even more datasets becoming publicly available. 


5.4.2.2 Academic Impact 


The availability of large datasets will impact academia (Tori) for two reasons. First, 
public datasets can be used by researchers from disciplines such as social science 
and economics to support their research activities. Second, a platform for sharing 
academic dataset will stimulate reuse and improve the quality of studied datasets. 
Sharing datasets also allows others to add additional annotations to the data, which 
is generally an expensive task. 


12 http://tinyurl.com/kvlv641 
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Next to seeing big data technologies affecting other scientific disciplines, other 
scientific disciplines are being brought into computer science. Big Internet compa- 
nies like Yahoo are hiring social scientists, including psychologists and economists, 
to increase the effectiveness of analysis tools (Mika). More generally speaking, as 
the analysis of data in various domains continues an increasing need for domain 
experts arises. 


5.5 Sectors Case Studies for Big Data Analysis 


This section describes several big data case studies outlining the stakeholders 
involved, where applicable, and the relationship between technology and the 
overall sector context. In particular, it covers the following sectors: the public 
sector, health sector, retail sector, logistics, and finally the financial sector. In 
many cases the descriptions are supported by the interviews that were conducted, 
and add further evidence of the enormous potential for big data. 


5.5.1 Public Sector 


Smart cities generate data from sensors, social media, citizen mobile reports, and 
municipality data such as tax data. Big data technologies are used to process the 
large datasets that cities generate to impact society and businesses (Baeza- Yates). 
This section discusses how big data technologies utilize smart city data to provide 
applications in traffic and emergency response. 


5.5.1.1 Traffic 


Smart city sensors that can be used for applications in traffic include induction loop 
detection, traffic cameras, and license plate recognition cameras (LPR). Induction 
loops can be used for counting traffic volume at a particular point. Traffic cameras 
can be combined with video analytic solutions to automatically extract statistics 
such as the number of cars passing and average speed of traffic. License plate 
recognition is a camera-based technology that can track license plates throughout 
the city using multiple cameras. All these forms of sensing help in estimating traffic 
statistics, although they vary in degree of accuracy and reliability. 

Deploying such technology on a city-wide level results in large datasets that can 
be used for day-to-day operations, as well as applications such as anomaly detection 
and support in planning operations. In terms of big data analysis, the most inter- 
esting application is anomaly detection. The system can learn from historical data 
what is considered to be normal traffic behaviour for the time of the day and the day 
of the week and detect deviations from the norm to inform operators in a command 
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and control centre of possible incidents that require attention (Thajchayapong and 
Barria 2010). Such an approach becomes even more powerful when combining the 
data from multiple locations using data fusion to get more accurate estimates of the 
traffic statistics that allow the detection of more complex scenarios. 


5.5.1.2 Emergency Response 


Cities equipped with sensors can benefit during emergencies by obtaining action- 
able information that can aid in decision-making. Of particular interest is the 
possibility to use social media analytics during emergency response. Social media 
networks provide a constant flow of information that can be used as a low-cost 
global sensing network for gathering near real-time information about an emer- 
gency. Although people post a lot of unrelated information on social media net- 
works, any information about the emergency can be very valuable to emergency 
response teams. Accurate data can help in obtaining the correct situational aware- 
ness picture of the emergency, consequently enabling a more efficient and faster 
response that can reduce casualties and overall damage (Van Kasteren et al. 2014). 

Social media analytics is used to process large volumes of social media posts, 
such as tweets, to identify clusters of posts centred around the same topic (high 
content overlap), same area (for posts that contain GPS tags), and around the same 
time. Clusters of posts are the result of high social network activity in an area. This 
can be an indication of a landmark (e.g. the Eiffel tower), a planned event (e.g. a 
sports match), or an unplanned event (e.g. an accident). Landmark sites have high 
tweet volumes throughout the year and can therefore be easily filtered out. For the 
remaining events machine-learning classifiers are used to automatically recognize 
which clusters are of interest for an emergency response operator (Walther and 
Kaisser 2013). 

Using social media data for purposes that it was not originally intended for is just 
a single example of the significant impact that can occur when the right data is 
presented to the right people at the right time. Some of our interviewees explained 
that there is far more data out there than most people realize and this data could help 
us to make better decisions to identify threats and see opportunities. A lot of the 
data needed already exists, but it is not always easy to find and use this data 
(Gislason) (Halevy). 


5.5.2 Health 


The previous section spoke of the data that is repurposed in applications that differ 
strongly from the original application that generated the data. Such cases also exist 
in the healthcare sector. For example, dynamic social processes such as the spread 
of disease can be accurately tracked by Google searches (Bancilhon) and call detail 
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records from Telefonica have been used to measure the impact of epidemic alerts on 
human mobility (Frias-Martinez et al. 2012). 

Big data analytics can be used to solve significant problems globally. The EU is 
therefore advised to produce solutions that solve global problems rather than focus 
solely on problems that affect the EU (Thompson). An example is the construction 
of clean water wells in Africa. The decision on where to locate wells is based on 
spreadsheets that may contain data that has not been updated for 2 years. Given that 
new wells can stop working after 6 months this causes unnecessary hardship and 
more (Halevy). Technology might offer a solution, either by allowing citizen 
reports or by inferring the use of wells from other data sources. 

The impact in local healthcare is expected to be enormous. Various technolog- 
ical projects are aimed at realizing home healthcare, where at the very least people 
are able to record health-related measurements in their own homes. When com- 
bined with projects such as smart home solutions, it is possible to create rich 
datasets consisting of both health data and all kinds of behavioural data that can 
help tremendously in establishing a diagnosis, as well as getting a better under- 
standing of disease onset and development. 

There are, however, very strong privacy concerns in the healthcare sector that 
are likely to block many of these developments until they are resolved. Professor 
Marco Viceconti from the University of Sheffield outlined in his interview how 
certain recent developments such as k-anonymity can help protect privacy. A 
dataset has k-anonymity protection if the information for each individual in the 
dataset cannot be distinguished from at least k — 1 individuals whose information 
also appears in the dataset (Sweeney 2002). Professor Viceconti envisions a future 
system that can automatically protect privacy by serving as a membrane between a 
patient and an institute using the data, where data can flow both ways and all the 
necessary privacy policies and anonymization processes are executed automatically 
in between. Such a system would benefit both the patient, by providing a more 
accurate diagnosis, and the institute, by allowing research using real-world data. 


5.5.3 Retail 


O2 UK together with Telefónica Digital recently launched a service called 
Telefonica Dynamic Insights. This service takes all UK mobile data, including 
location, timing of calls and texts, and also when customers move from one mast to 
another. This data is mapped and repurposed for the retail industry. The data is first 
anonymized, aggregated, and placed in the cloud. Then analytics are run which 
calculate where people live, where they work, and where they are in transit. If this 
data is then combined with anonymized customer relationship management (CRM) 
data, it can determine the type of people who pass by a particular shop at a specific 
time-point. It can also calculate the type of people who visit a shop, where they live, 
and where else they shop (termed catchment). 
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This service supports real-estate management for retailers and contrasts well 
with present practice. What retailers do today is that they hire students with clickers 
just to count the number of people who walk past the shop, leading to data that is far 
less detailed. The service is thus solving an existing problem in a new way. The 
service can be run on a weekly or daily basis and provides completely new business 
opportunities. In addition to retail the service could be run in other sectors, for 
example, within the public sector it could analyse who walks past an underground 
station. Combining mobile data with preference data could open up new proposi- 
tions for existing and new industries. This example is a taste of what is to come, the 
sum of which will definitely improve the competitiveness of European industry 
(Benjamins). 


5.5.4 Logistics 


In the United States, 45 % of fruits and vegetables reach the plate of the consumer 
and in Europe 55 % reaches the plate. Close to half of what is produced is lost. This 
is a big data problem: collecting data over the overall supply chain, analysing 
systems related to the distributed food, and identifying leaks and bottlenecks in the 
process would have an enormous impact. If implemented there would be a better 
handle on prices and a fairer distribution of wealth among all the agents in the food 
supply chain. Big data technology is important and so is access to the right data and 
data sources (Bancilhon). 


5.5.5 Finance 


The World Bank is an organization that aims to end extreme poverty and promote 
shared prosperity. Their operations strongly rely on accurate information and they 
are using big data analytics to support their activities. They plan to organize 
competitions to drive the analytic capabilities to obtain an alternative measure for 
poverty and to detect financial corruption and fraud at an early stage. 

In terms of poverty, an important driver is to get more real-time estimates of 
poverty, which make it possible to make better short-term decisions. Three exam- 
ples of information sources that are currently being explored to obtain the infor- 
mation needed are: (1) Twitter data can be used to look for indicators of social and 
economic well-being; (2) poverty maps can be merged with alternative data sources 
such as satellite imagery to identify paved roads and support decisions in micro 
financing; and (3) web data can be scraped to get pricing data from supermarkets 
that help in poverty estimation. 

Corruption is currently dealt with reactively, meaning actions are only taken 
once corruption has been reported to the Worldbank. On average only 30 % of the 
money is retrieved in corruption cases when dealt with reactively. Big data 
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analytics will make more proactive approaches feasible, resulting in higher returns. 
This requires creating richer profiles of the companies and the partners that they 
work with. Data mining this in-depth profile data together with other data sources 
would make it possible to identify risk-related patterns. 

Overall, it is important for the Worldbank to be able to make decisions, move 
resources, and make investment options available as fast as possible through the 
right people at the right time. Doing this based on limited sets of old data is not 
sustainable in the medium to long term. Accurate and real-time information is 
critical during the decision-making process. For example, if there is a recession 
looming, one needs to respond before it happens. If a natural disaster occurs, 
making decisions based on data available directly from the field rather than a 
3-year-old dataset is highly desirable (Das). 


5.6 Conclusions 


Big data analysis is a fundamental part of the big data value chain. We can 
caricature this process using an old English saying that what this component 
achieves is to “turn lead into gold”. Large volumes of data which may be hetero- 
geneous with respect to encoding mechanism, format, structure, underlying seman- 
tics, provenance, reliability, and quality is turned into data which is usable. 

As such big data analysis comprises a collection of techniques and tools some of 
which are old mechanisms recast to face the challenges raised by the three Vs 
(e.g. large-scale reasoning) and some of which are new (e.g. stream reasoning). 

The insights gathered on big data analysis presented here are based upon 
19 interviews with leading players in large and small industries and visionaries 
from Europe and the United States. The choice was taken to interview senior staff 
members who have a leadership role in large multinationals, technologists who 
work at the coalface with big data, founders and CEOs of the new breed of SMEs 
that are already producing value from big data, and academic leaders in the field. 

From our analysis it is clear that delivering highly scalable data analysis and 
reasoning mechanisms that are associated with an ecosystem of accessible and 
usable tools will produce significant benefits for Europe. The impact will be both 
economic and social. Current business models and process will be radically 
transformed for economic and social benefit. The case study of reducing the amount 
of food wasted within the global food production life cycle is a prime example of 
this type of potential for big data. 

To summarize, big data analysis is an essential part of the overall big data value 
chain which promises to have significant economic and social impact in the 
European Union in the near to medium term. Without big data analysis the rest of 
the chain does not function. As one of our interviewees stated in a recent discussion 
on the relationship between data analysis and data analytics: 
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Analytics without data is worthless. Analytics with bad data is dangerous. Analytics with 
good data is the objective.'* 


We wholeheartedly agree. 
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Chapter 6 
Big Data Curation 


André Freitas and Edward Curry 


6.1 Introduction 


One of the key principles of data analytics is that the quality of the analysis is 
dependent on the quality of the information analysed. Gartner estimates that more 
than 25 % of critical data in the world’s top companies is flawed (Gartner 2007). 
Data quality issues can have a significant impact on business operations, especially 
when it comes to the decision-making processes within organizations (Curry 
et al. 2010). 

The emergence of new platforms for decentralized data creation such as sensor 
and mobile platforms, the increasing availability of open data on the web (Howe 
et al. 2008), added to the increase in the number of data sources inside organizations 
(Brodie and Liu 2010), brings an unprecedented volume of data to be managed. In 
addition to the data volume, data consumers in the big data era need to cope with 
data variety, as a consequence of the decentralized data generation, where data is 
created under different contexts and requirements. Consuming third-party data 
comes with the intrinsic cost of repurposing, adapting, and ensuring data quality 
for its new context. 

Data curation provides the methodological and technological data management 
support to address data quality issues maximizing the usability of the data. 
According to Cragin et al. (2007), “Data curation is the active and on-going 
management of data through its lifecycle of interest and usefulness; ... curation 
activities enable data discovery and retrieval, maintain quality, add value, and 
provide for re-use over time”. Data curation emerges as a key data management 
process where there is an increase in the number of data sources and platforms for 
data generation. 
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Fig. 6.1 Data curation in the big data value chain 


The position of big data curation within the overall big data value chain can be 
seen in Fig. 6.1. Data curation processes can be categorized into different activities 
such as content creation, selection, classification, transformation, validation, and 
preservation. The selection and implementation of a data curation process is a 
multi-dimensional problem, depending on the interaction between the incentives, 
economics, standards, and technological dimensions. This chapter analyses the 
data dynamics in which data curation is inserted, investigates future requirements 
and emerging trends for data curation, and briefly describes exemplar case studies. 


6.2 Key Insights for Big Data Curation 


eScience and eGovernment are the innovators while biomedical and media 
companies are the early adopters. The demand for data interoperability and 
reuse on eScience and the demand for effective transparency through open data 
in the context of eGovernment are driving data curation practices and technologies. 
These sectors play the roles of visionaries and innovators in the data curation 
technology adoption lifecycle. From the industry perspective, organizations in the 
biomedical space, such as pharmaceutical companies, play the role of early 
adopters, driven by the need to reduce the time-to-market and lower the costs of 
the drug discovery pipelines. Media companies are also early adopters, driven by 
the need to organize large unstructured data collections, to reduce the time to create 
new products, repurposing existing data, and to improve accessibility and visibility 
of information artefacts. 
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The core impact of data curation is to enable more complete and high-quality 
data-driven models for knowledge organizations. More complete models sup- 
port a larger number of answers through data analysis. Data curation practices and 
technologies will progressively become more present in contemporary data man- 
agement environments, facilitating organizations and individuals to reuse third- 
party data in different contexts, reducing the barriers for generating content with 
high data quality. The ability to efficiently cope with data quality and heterogeneity 
issues at scale will support data consumers on the creation of more sophisticated 
models, highly impacting the productivity of knowledge-driven organizations. 


Data curation depends on the creation of an incentives structure. As an 
emergent activity, there is still vagueness and poor understanding on the role of 
data curation inside the big data lifecycle. In many projects the data curation costs 
are not estimated or are underestimated. The individuation and recognition of the 
data curator role and of data curation activities depends on realistic estimates of the 
costs associated with producing high-quality data. Funding boards can support this 
process by requiring an explicit estimate of the data curation resources on public 
funded projects with data deliverables and by requiring the publication of high- 
quality data. Additionally, the improvement of the tracking and recognition of data 
and infrastructure as a first-class scientific contribution is also a fundamental driver 
for methodological and technological innovation for data curation and for maxi- 
mizing the return of investment and reusability of scientific outcomes. Similar 
recognition is needed within the enterprise context. 


Emerging economic models can support the creation of data curation infra- 
structures. Pre-competitive and public-private partnerships are emerging eco- 
nomic models that can support the creation of data curation infrastructures and 
the generation of high-quality data. Additionally, the justification for the investment 
on data curation infrastructures can be supported by a better quantification of the 
economic impact of high-quality data. 


Curation at scale depends on the interplay between automated curation plat- 
forms and collaborative approaches leveraging large pools of data curators. 
Improving the scale of data curation depends on reducing the cost per data curation 
task and increasing the pool of data curators. Hybrid human-algorithmic data 
curation approaches and the ability to compute the uncertainty of the results of 
algorithmic approaches are fundamental for improving the automation of complex 
curation tasks. Approaches for automating data curation tasks such as curation by 
demonstration can provide a significant increase in the scale of automation. 
Crowdsourcing also plays an important role in scaling-up data curation, allowing 
access to large pools of potential data curators. The improvement of crowdsourcing 
platforms towards more specialized, automated, reliable, and sophisticated plat- 
forms and the improvement of the integration between organizational systems and 
crowdsourcing platforms represent an exploitable opportunity in this area. 
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The improvement of human-data interaction is fundamental for data 
curation. Improving approaches in which curators can interact with data impacts 
curation efficiency and reduces the barriers for domain experts and casual users to 
curate data. Examples of key functionalities in human-—data interaction include 
natural language interfaces, semantic search, data summarization and visualization, 
and intuitive data transformation interfaces. 


Data-level trust and permission management mechanisms are fundamental 
to supporting data management infrastructures for data curation. Provenance 
management is a key enabler of trust for data curation, providing curators the 
context to select data that they consider trustworthy and allowing them to capture 
their data curation decisions. Data curation also depends on mechanisms to assign 
permissions and digital rights at the data level. 


Data and conceptual model standards strongly reduce the data curation 
effort. A standards-based data representation reduces syntactic and semantic het- 
erogeneity, improving interoperability. Data model and conceptual model standards 
(e.g. vocabularies and ontologies) are available in different domains. However, 
their adoption is still growing. 


There is the need for improved theoretical models and methodologies for data 
curation activities. Theoretical models and methodologies for data curation 
should concentrate on supporting the transportability of the generated data under 
different contexts, facilitating the detection of data quality issues and improving the 
automation of data curation workflows. 


Better integration between algorithmic and human computation approaches is 
required. The growing maturity of data-driven statistical techniques in fields such 
as Natural Language Processing (NLP) and Machine Learning (ML) is shifting their 
use from academic to industry environments. Many NLP and ML tools have 
uncertainty levels associated with their results and are dependent on training over 
large datasets. Better integration between statistical approaches and human com- 
putation platforms is essential to allow the continuous evolution of statistical 
models by the provision of additional training data and also to minimize the impact 
of errors in the results. 


6.3 Emerging Requirements for Big Data Curation 


Many big data scenarios are associated with reusing and integrating data from a 
number of different data sources. This perception is recurrent across data curation 
experts and practitioners and it is reflected in statements such as: “a lot of big data is 
a lot of small data put together’, “most of big data is not a uniform big block”, “each 
data piece is very small and very messy, and a lot of what we are doing there is 


dealing with that variety” (Data Curation Interview: Paul Groth 2014). 
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Reusing data that was generated under different requirements comes with the 
intrinsic price of coping with data quality and data heterogeneity issues. Data can 
be incomplete or may need to be transformed in order to be rendered useful. Kevin 
Ashley, director of Digital Curation Centre, summarizes the mind-set behind data 
reuse: “... [it is] when you simply use what is there, which may not be what you 
would have collected in an ideal world, but you may be able to derive some useful 
knowledge from it” (Kevin Ashley 2014). In this context, data shifts from a 
resource that is tailored from the start to a certain purpose, to a raw material that 
will need to be repurposed in different contexts in order to satisfy a particular 
requirement. 

In this scenario data curation emerges as a key data management activity. Data 
curation can be seen from a data generation perspective (curation at source), where 
data is represented in a way that maximizes its quality in different contexts. Experts 
emphasize this as an important aspect of data curation: From the data science 
aspect, methodologies are needed to describe data so that it is actually reusable 
outside its original context (Kevin Ashley 2014). This points to the demand to 
investigate approaches which maximize the quality of the data in multiple contexts 
with a minimum curation effort: “we are going to curate data in a way that makes it 
usable ideally for any question that somebody might try to ask the data” (Kevin 
Ashley 2014). Data curation can also be done at the data consumption side where 
data resources are selected and transformed to fit a set of requirements from the data 
consumption side. 

Data curation activities are heavily dependent on the challenges of scale, in 
particular data variety, that emerges in the big data context. James Cheney, research 
fellow at the University of Edinburgh, observes “Big Data seems to be about 
addressing challenges of scale, in terms of how fast things are coming out at you 
versus how much it costs to get value out of what you already have”. Coping with 
data variety can be costly even for smaller amounts of data: “you can have Big Data 
challenges not only because you have Petabytes of data but because data is 
incredibly varied and therefore consumes a lot of resources to make sense of it’. 

While in the big data context the expression data variety is used to express the 
data management trend of coping with data from different sources, the concepts of 
data quality (Wang and Strong 1996; Knight and Burn 2005) and data heteroge- 
neity (Sheth 1999) have been well established in the database literature and provide 
a precise ground for understanding the tasks involved in data curation. 

Despite the fact that data heterogeneity and data quality were concerns already 
present before the big data scale era (Wang and Strong 1996; Knight and Burn 
2005), they become more prevalent in data management tasks with the growth in 
the number of data sources. This growth brought the need to define principles and 
scalable approaches for coping with data quality issues. It also brought data 
curation from a niche activity, restricted to a small community of scientists and 
analysts with high data quality standards, to a routine data management activity, 
which will progressively become more present within the average data management 
environment. 

The growth in the number of data sources and the scope of databases defines a long 
tail of data variety (Curry and Freitas 2014). Traditional relational data management 
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Fig. 6.2 The long tail of data curation and the scalability of data curation activities 


environments were focused on data that mapped to frequent business processes and 
were regular enough to fit into a relational model. The long tail of data variety (see 
Fig. 6.2) expresses the shift towards expanding the data coverage of data management 
environments towards data that is less frequently used, more decentralized, and less 
structured. The long tail allows data consumers to have a more comprehensive model 
of their domain that can be searched, queried, analysed, and navigated. 

The central challenge of data curation models in the big data era is to deal with 
the long tail of data and to improve data curation scalability, by reducing the cost of 
data curation and increasing the number of data curators (Fig. 6.2), allowing data 
curation tasks to be addressed under limited time constraints. 

Scaling up data curation is a multidisciplinary problem that requires the devel- 
opment of economic models, social structures, incentive models, and standards, in 
coordination with technological solutions. The connection between these dimen- 
sions and data curation scalability is at the centre of the future requirements and 
future trends for data curation. 


6.4 Social and Economic Impact of Big Data Curation 


The growing availability of data brings the opportunity for people to use them to 
inform their decision-making process, allowing data consumers to have a more 
complete data-supported picture of reality. While some big data use cases are based 
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on large scale but small schema and regular datasets, other decision-making 
scenarios depend on the integration of complex, multi-domain, and distributed 
data. The extraction of value from information coming from different data sources 
is dependent on the feasibility of integrating and analysing these data sources. 

Decision-makers can range from molecular biologists to government officials or 
marketing professionals and they have in common the need to discover patterns and 
create models to address a specific task or a business objective. These models need 
to be supported by quantitative evidence. While unstructured data (such as text 
resources) can support the decision-making process, structured data provides users 
greater analytical capabilities, by defining a structured representation associated 
with the data. This allows users to compare, aggregate, and transform data. With 
more data available, the barrier of data acquisition is reduced. However, to extract 
value from it, data needs to be systematically processed, transformed, and 
repurposed into a new context. 

Areas that depend on the representation of multi-domain and complex models 
are leading the data curation technology lifecycle. eScience projects lead the 
experimentation and innovation on data curation and are driven by the need to 
create infrastructures for improving reproducibility and large-scale multidis- 
ciplinary collaboration in science. They play the role of visionaries in the technol- 
ogy adoption lifecycle for advanced data curation technologies (see Use Cases 
Section). 

In the early adopter phase of the lifecycle, the biomedical industry (in particular, 
the pharmaceutical industry) is the main player, driven by the need of reducing the 
costs and time-to-market of drug discovery pipelines (Data Curation Interview: 
Nick Lynch 2014). For pharmaceutical companies data curation is central to 
organizational data management and third-party data integration. Following a 
different set of requirements, the media industry is also positioned as early 
adopters, using data curation pipelines to classify large collections of unstructured 
resources (text and video), improving the data consumption experience through 
better accessibility and maximizing its reuse under different contexts. The third 
major early adopters are governments, targeting transparency through open data 
projects (Shadbolt et al. 2012). 

Data curation enables the extraction of value from data, and it is a capability that 
is required for areas that are dependent on complex and/or continuous data inte- 
gration and classification. The improvement of data curation tools and methods 
directly provides greater efficiency of the knowledge discovery process, maximizes 
return of investment per data item through reuse, and improves organizational 
transparency. 
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6.5 Big Data Curation State of the Art 


This section concentrates on briefly describing the technologies that are widely 
adopted and established approaches for data curation, while the next section focuses 
on the future requirements and emerging approaches. 


Master Data Management is composed of the processes and tools that support a 
single point of reference for the data of an organization, an authoritative data 
source. Master Data Management (MDM) tools can be used to remove duplicates 
and standardize data syntax, as an authoritative source of master data. MDM 
focuses on ensuring that an organization does not use multiple and inconsistent 
versions of the same master data in different parts of its systems. Processes in MDM 
include source identification, data transformation, normalization, rule administra- 
tion, error detection and correction, data consolidation, data storage, classification, 
taxonomy services, schema mapping, and semantic enrichment. 

Master data management is highly associated with data quality. According to 
Morris and Vesset (2005), the three main objectives of MDM are: 


1. Synchronizing master data across multiple instances of an enterprise application 

2. Coordinating master data management during an application migration 

3. Compliance and performance management reporting across multiple analytic 
systems 


Rowe (2012) provides an analysis on how 163 organizations implement MDM 
and its business impact. 


Curation at Source Sheer curation or curation-at-source is an approach to curate 
data where lightweight curation activities are integrated into the normal workflow 
of those creating and managing data and other digital assets (Curry et al. 2010). 
Sheer curation activities can include lightweight categorization and normalization 
activities. An example would be vetting or “rating” the results of a categorization 
process performed by a curation algorithm. Sheer curation activities can also be 
composed with other curation activities, allowing more immediate access to curated 
data while also ensuring the quality control that is only possible with an expert 
curation team. 

The following are the high-level objectives of sheer curation described by 
Hedges and Blanke (2012): 


¢ Avoid data deposit by integrating with normal workflow tools 
e Capture provenance information of the workflow 
e Seamless interfacing with data curation infrastructure 


Crowdsourcing Data curation can be a resource-intensive and complex task, 
which can easily exceed the capacity of a single individual. Most non-trivial data 
curation efforts are dependent of a collective data curation set-up, where partici- 
pants are able to share the costs, risks, and technical challenges. Depending on the 
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domain, data scale, and type of curation activity, data curation efforts can utilize 
relevant communities through invitation or crowds (Doan et al. 2011). These 
systems can range from systems with a large and open participation base such as 
Wikipedia (crowds-based) to systems or more restricted domain expert groups, 
such as Chemspider. 

The notion of “wisdom of crowds” advocates that potentially large groups of 
non-experts can solve complex problems usually considered to be solvable only by 
experts (Surowiecki 2005). Crowdsourcing has emerged as a powerful paradigm for 
outsourcing work at scale with the help of online people (Doan et al. 2011). 
Crowdsourcing has been fuelled by the rapid development in web technologies 
that facilitate contributions from millions of online users. The underlying assump- 
tion is that large-scale and cheap labour can be acquired on the web. The effec- 
tiveness of crowdsourcing has been demonstrated through websites like 
Wikipedia,’ Amazon Mechanical Turk,” and Kaggle.* Wikipedia follows a volun- 
teer crowdsourcing approach where the general public is asked to contribute to the 
encyclopaedia creation project for the benefit of everyone (Kittur et al. 2007). 
Amazon Mechanical Turk provides a labour market for crowdsourcing tasks 
against money (Ipeirotis 2010). Kaggle enables organization to publish problems 
to be solved through a competition between participants against a predefined 
reward. Although different in terms of incentive models, all these websites allow 
access to large numbers of workers, therefore, enabling their use as recruitment 
platforms for human computation (Law and von Ahn 2011). 

General-purpose crowdsourcing service platforms such as CrowdFlower 
(CrowdFlower Whitepaper 2012) or Amazon Mechanical Turk (Ipeirotis 2010) 
allow projects to route tasks for a paid crowd. The user of the service is abstracted 
from the effort of gathering the crowd and offers its tasks for a price in a market of 
crowd-workers. Crowdsourcing service platforms provide a flexible model and can 
be used to address ad hoc small-scale data curation tasks (such as a simple 
classification of thousands of images for a research project), peak data curation 
volumes (e.g. mapping and translating data in an emergency response situation), or 
at regular curation volumes (e.g. continuous data curation for a company). 


Collaboration spaces such as Wiki platforms and Content Management Systems 
(CMSs) allow users to collaboratively create and curate unstructured and structured 
data. While CMSs focuses on allowing smaller and more restricted groups to 
collaboratively edit and publish online content (such as News, blogs, and 
eCommerce platforms), Wikis have proven to scale to very large user bases. As 
of 2014, Wikipedia counted more than 4,000,000 articles and has a community with 
more than 130,000 active registered contributors. 


! “Wikipedia” 2005. 12 Feb 2014. https://www.wikipedia.org/ 
2 « Amazon Mechanical Turk” 2007. 12 Feb 2014. https://www.mturk.com/ 
3 “Kaggle: Go from Big Data to Big Analytics” 2005. 12 Feb 2014. http://www.kaggle.com/ 
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Wikipedia uses a wiki as its main system for content construction. Wikis were 
first proposed by Ward Cunningham in 1995 and allow users to edit contents and 
collaborate on the web more efficiently. MediaWiki, the wiki platform behind 
Wikipedia, is already widely used as a collaborative environment inside organiza- 
tions. Important cases include Intellipedia, a deployment of the MediaWiki plat- 
form covering 16 U.S. Intelligence agencies, and Wiki Proteins, a collaborative 
environment for knowledge discovery and annotation (Mons et al. 2008). 

Wikipedia relies on a simple but highly effective way to coordinate its curation 
process, and accounts and roles are in the base of this system. All users are allowed 
to edit Wikipedia contents. Administrators, however, have additional permissions 
in the system (Curry et al. 2010). Most of Wikis and CMS platforms target 
unstructured and semi-structured data content, allowing users to classify and 
interlink unstructured content. 


6.5.1 Data Curation Platforms 


e Data Tamer: This prototype aims to replace the current developer-centric 
extract-transform-load (ETL) process with automated data integration. The 
system uses a suit of algorithms to automatically map schemas and 
de-duplicate entities. However, human experts and crowds are leveraged to 
verify integration updates that are particularly difficult for algorithms. 

e ZenCrowd: This system tries to address the problem of linking named entities in 
text with a knowledge base. ZenCrowd bridges the gap between automated and 
manual linking by improving the results of automated linking with humans. The 
prototype was demonstrated for linking named entities in news articles with 
entities in linked open data cloud. 

e CrowdDB: This database system answers SQL queries that cannot be answered 
by a database management system or a search engine. As opposed to the exact 
operation in databases, CrowdDB allows fuzzy operations with the help of 
humans, for example, ranking items by relevance or comparing equivalence of 
images. 

e Qurk: Although similar to CrowdDB, this system tries to improve costs and 
latency of human-powered sorts and joins. In this regard, Qurk applies tech- 
niques such as batching, filtering, and output agreement. 

e Wikipedia Bots: Wikipedia runs scheduled algorithms to access quality of text 
articles, known as Bots. These bots also flag articles that require further review 
by experts. SuggestBot recommends flagged articles to a Wikipedia editor based 
on their profile. 
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6.6 Future Requirements and Emerging Trends for Big 
Data Curation 


This section aims at providing a roadmap for data curation based on a set of future 
requirements for data curation and emerging data curation approaches for coping 
with the requirements. Both future requirements and the emerging approaches were 
collected by an extensive analysis of the state-of-the-art approaches. 


6.6.1 Future Requirements for Big Data Curation 


The list of future requirements was compiled by selecting and categorizing the most 
recurrent demands in a state-of-the-art survey and which emerged in domain expert 
interviews as a fundamental direction for the future of data curation. Each require- 
ment is categorized according to the following attributes (Table 6.1): 


e Core Requirement Dimensions: Consists of the main categories needed to 
address the requirement. The dimensions are technical, social, incentive, meth- 
odological, standardization, economic, and policy. 

e Impact-level: Consists of the impact of the requirement for the data curation 
field. By its construction, only requirements above a certain impact threshold are 
listed. Possible values are medium, medium-high, high, very high. 

¢ Affected areas: Lists the areas which are most impacted by the requirement. 
Possible values are science, government, industry sectors (financial, health, 
media and entertainment, telco, manufacturing), and environmental. 

e Priority: Covers the level of priority that is associated with the requirement. 
Possible values are: short-term (<3 years), medium-term (3-7 years), and 
consolidation (>7 years). 

e Core Actors: Covers the main actors that should be responsible for addressing 
the core requirement. Core actors are government, industry, academia, 
non-governmental organizations, and user communities. 


6.6.2 Emerging Paradigms for Big Data Curation 


In the state-of-the-art analysis, key social, technical, and methodological 
approaches emerged for addressing the future requirements. In this section, these 
emerging approaches are described as well as their coverage in relation to the 
category of requirements. Emerging approaches are defined as approaches that have 
a limited adoption. These approaches are summarized in Table 6.2. 
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Table 6.2 Emerging approaches for addressing the future requirements 


Requirement 
category 


Emerging approach 


Adoption/status 


Exemplar use case 


Incentives crea- 
tion and social 


Open and interopera- 
ble data policies 


Early-stage/Limited 
adoption 


Data.gov.uk 


engagement Better recognition of | Lacking adoption/ Chemspider, Wikipedia, 
mechanisms the data curation role | Despite the exemplar Protein Data Bank 
use cases, the data cura- 
tor role is still not 
recognized 
Attribution and rec- Standards emerging/ Altmetrics (Priem 
ognition of data and | Adoption missing et al. 2010), ORCID 
infrastructure 
contributions 
Better understanding | Early-stage GalaxyZoo (Forston 
of social engagement et al. 2011), Foldit 
mechanisms (Khatib et al. 2011) 
Economic Pre-competitive Seminal use cases Pistoia Alliance (Barnes 
models partnerships et al. 2009) 


Public—private 
partnerships 


Seminal use cases 


Geoconnections (Harper 
2012) 


Quantification of the 
economic impact of 
data 


Seminal use cases 


Technopolis Group 
(2011) (“Data centres: 
their use, value and 
impact”) 


Curation at 
scale 


Human computation 
and Crowdsourcing 
services 


Industry-level adoption/ 
Services are available 
but there is space for 
market specialization 


CrowdFlower, Amazon 
Mechanical Turk 


Evidence-based mea- 
surement models of 
uncertainty over data 


Research stage 


IBM Watson (Ferrucci 
et al. 2010) 


Programming by 
demonstration, induc- 
tion of data transfor- 
mation workflows 


Research stage/Funda- 
mental research areas 
are developed. Lack of 
applied research in a 
workflow and data 
curation context 


Tuchinda et al. (2007), 
Tuchinda (2011) 


Curation at source 


Existing use cases both 
in academic projects and 
industry 


The New York Times 


General-purpose data 
curation pipelines 


Available Infrastructure 


OpenRefine, Karma, 
Scientific Workflow 
management systems 


Algorithmic valida- 
tion/annotation 


Early stage 


Wikipedia, Chemspider 


Focus ease of 
interactivity 


Human—data 
interaction 


Seminal tools available 


OpenRefine 


Natural language 
interfaces, schema- 
agnostic queries 


Research stage 


IBM Watson (Ferrucci 
et al. 2010), Treo 
(Freitas and Curry 2014) 
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Requirement 
category 


Emerging approach 


Adoption/status 


Exemplar use case 


Trust 


Capture of data 
curation decisions 


Standards are in place, 
instrumentation of 
applications needed 


OpenPhacts 


Fine-grained permis- 
sion management 
models and tools 


Coarse-grained infra- 
structure available. 


Qin and Atluri (2003), 
Ryutov et al. (2009), 
Kirrane et al. (2013), 
Rodriguez-Doncel 

et al. (2013) 


Standardization | Standardized data Standards are available | RDF(S), OWL 
and model 
interoperability | Reuse of vocabularies | Technologies for Linked Open Data Web 
supporting vocabulary (Berners-Lee 2009) 
reuse is needed 
Better integration and | Low N/A 
communication 
between tools 
Interoperable prove- | Standard in place/Stan- | W3C PROV 
nance representation | dard adoption is still 
missing 
Curation Definition of mini- Low adoption MIRIAM (Laibe and Le 
models mum information Novére 2007) 
models for data 
curation 
Nanopublications Emerging concept Mons and Velterop 
(2009), Groth 
et al. (2010) 
Investigation of theo- | Emerging concept Pearl and Bareinboim 
retical principles and (2011) 
domain-specific 
models for data 
curation 
Unstructured- NLP Pipelines Tools are available, IBM Watson (Ferrucci 
structured adoption is low et al. 2010) 
integration Entity recognition Tools are available, DBpedia Spotlight 


and alignment 


adoption is low 


(Mendes et al. 2011), 
IBM Watson (Ferrucci 
et al. 2010) 


6.6.2.1 Social Incentives and Engagement Mechanisms 


Open and Interoperable Data Policies The demand for high-quality data is the 
driver of the evolution of data curation platforms. The effort to produce and 
maintain high-quality data needs to be supported by a solid incentives system, 
which at this point in time is not fully in place. High-quality open data can be one of 
the drivers of societal impact by supporting more efficient and reproducible science 
(eScience) (Norris 2007), and more transparent and efficient governments 
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(eGovernment) (Shadbolt et al. 2012). These sectors play the innovators and early 
adopters roles in the data curation technology adoption lifecycle and are the main 
drivers of innovation in data curation tools and methods. Funding agencies and 
policy makers have a fundamental role in this process and should direct and support 
scientists and government officials to make available their data products in an 
interoperable way. The demand for high quality and interoperable data can drive 
the evolution of data curation methods and tools. 


Attribution and Recognition of Data and Infrastructure Contributions From 
the eScience perspective, scientific and editorial committees of prestigious publi- 
cations have the power to change the methodological landscape of scholarly 
communication, by emphasizing reproducibility in the review process and by 
requiring publications to be supported by high quality data when applicable. 
From the scientist perspective, publications supported by data can facilitate repro- 
ducibility and avoid rework and as a consequence increase scientific efficiency and 
impact of the scientific products. Additionally, as data becomes more prevalent as a 
primary scientific product it becomes a citable resource. Mechanisms such as 
ORCID (Thomson Reuters Technical Report 2013) and Altmetrics (Priem 
et al. 2010) already provide the supporting elements for identifying, attributing, 
and quantifying impact outputs such as datasets and software. The recognition of 
data and software contributions in academic evaluation systems is a critical element 
for driving high-quality scientific data. 


Better Recognition of the Data Curation Role The cost of publishing high- 
quality data is not negligible and should be an explicit part of the estimated costs 
of a project with a data deliverable. Additionally, the methodological impact of data 
curation requires that the role of the data curator be better recognized across the 
scientific and publishing pipeline. Some organizations and projects have already a 
clear definition of different data curator roles. Examples are Wikipedia, New York 
Times (Curry et al. 2010), and Chemspider (Pence and Williams 2010). The reader 
is referred to the case studies to understand the activities of different data curation 
roles. 


Better Understanding of Social Engagement Mechanisms While part of the 
incentives structure may be triggered by public policies, or by direct financial gain, 
others may emerge from the direct benefits of being part of a project that is 
meaningful for a user community. Projects such as Wikipedia, GalaxyZoo (Forston 
et al. 2011), or FoldIt (Khatib et al. 2011) have collected large bases of volunteer 
data curators exploring different sets of incentive mechanisms, which can be based 
on visibility and social or professional status, social impact, meaningfulness, or fun. 
The understanding of these principles and the development of the mechanisms 
behind the engagement of large user bases is an important issue for amplifying data 
curation efforts. 
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6.6.2.2 Economic Models 


Emerging economic models can provide the financial basis to support the genera- 
tion and maintenance of high-quality data and the associated data curation 
infrastructures. 


Pre-competitive Partnerships for Data Curation A pre-competitive collabora- 
tion scheme is one economic model in which a consortium of organizations, which 
are typically competitors, collaborate in parts of the Research & Development 
(R&D) process which does not impact on their commercial competitive advantage. 
This allows partners to share the costs and risks associated with parts of the R&D 
process. One case of this model is the Pistoia Alliance (Barnes et al. 2009), which is 
a precompetitive alliance of life science companies, vendors, publishers, and 
academic groups that aims to lower barriers to innovation by improving the 
interoperability of R&D business processes. The Pistoia Alliance was founded by 
pharmaceutical companies such as AstraZeneca, GSK, Pfizer, and Novartis, and 
examples of shared resources include data and data infrastructure tools. 


Public-Private Data Partnerships for Curation Another emerging economic 
model for data curation are public-private partnerships (PPP), in which private 
companies and the public sector collaborate towards a mutual benefit partnership. 
In a PPP the risks, costs, and benefits are shared among the partners, which have 
non-competing, complementary interests over the data. Geospatial data and its high 
impact for both the public (environmental, administration) and private (natural 
resources companies) sectors is one of the early cases of PPPs. GeoConnections 
Canada is an example of a PPP initiative launched in 1999, with the objective of 
developing the Canadian Geospatial Data Infrastructure (CGDI) and publishing 
geospatial information on the web (Harper 2012; Data Curation Interview: Joe 
Sewash 2014). GeoConnections has been developed on a collaborative model 
involving the participation of federal, provincial, and territorial agencies, and the 
private and academic sectors. 


Quantification of the Economic Impact of Data The development of approaches 
to quantify the economic impact, value creation, and associated costs behind data 
resources is a fundamental element for justifying private and public investments in 
data infrastructures. One exemplar case of value quantification is the JISC study 
“Data centres: their use, value and impact” (Technopolis Group 2011), which 
provides a quantitative account of the value creation process of eight data centres. 
The creation of quantitative financial measures can provide the required evidence to 
support data infrastructure investments both public and private, creating sustainable 
business models grounded on data assets, expanding the existing data economy. 
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6.6.2.3 Curation at Scale 


Human Computation and Crowdsourcing Services Crowdsourcing platforms 
are rapidly evolving but there is still a major opportunity for market differentiation 
and growth. CrowdFlower, for example, is evolving in the direction of providing 
better APIs, supporting better integration with external systems. 

Within crowdsourcing platforms, people show variability in the quality of work 
they produce, as well as the amount of time they take for the same work. Addi- 
tionally, the accuracy and latency of human processors is not uniform over time. 
Therefore, appropriate methods are required to route tasks to the right person at the 
right time (Hassan et al. 2012). Furthermore combining work by different people on 
the same task might also help in improving the quality of work (Law and von Ahn 
2009). Recruitment of suitable humans for computation is a major challenge of 
human computation. 

Today, these platforms are mostly restricted to tasks that can be delegated to a 
paid generic audience. Possible future differentiation avenues include: (1) support 
for highly specialized domain experts, (2) more flexibility in the selection of 
demographic profiles, (3) creation of longer term (more persistent) relationships 
with teams of workers, (4) creation of a major general purpose open crowdsourcing 
service platform for voluntary work, and (5) using historical data to provide more 
productivity and automation for data curators (Kittur et al. 2007). 


Instrumenting Popular Applications for Data Curation In most cases data 
curation is performed with common office applications: regular spreadsheets, text 
editors, and email (Data Curation Interview: James Cheney 2014). These tools are 
an intrinsic part of existing data curation infrastructures and users are familiarized 
with them. These tools, however, lack some of the functionalities which are 
fundamental for data curation: (1) capture and representation of user actions; 
(2) annotation mechanisms/vocabulary reuse; (3) ability to handle large-scale 
data; (4) better search capabilities; and (5) integration with multiple data sources. 

Extending applications with large user bases for data curation provides an 
opportunity for a low barrier penetration of data curation functionalities into 
more ad hoc data curation infrastructures. This allows wiring fundamental data 
curation processes into existing routine activities without a major disruption of the 
user working process (Data Curation Interview: Carole Goble 2014). 


General-Purpose Data Curation Pipelines While the adaptation and instrumen- 
tation of regular tools can provide a low-cost generic data curation solution, many 
projects will demand the use of tools designed from the start to support more 
sophisticated data curation activities. The development of general-purpose data 
curation frameworks that integrate core data curation functionalities into a large- 
scale data curation platform is a fundamental element for organizations that do 
large-scale data curation. Platforms such as Open Refine* and Karma (Gil 


4 http://openrefine.org/ 
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et al. 2011) provide examples of emerging data curation frameworks, with a focus 
on data transformation and integration. Differently from Extract Transform Load 
(ETL) frameworks, data curation platforms provide a better support for ad hoc, 
dynamic, manual, less frequent (long tail), and less scripted data transformations 
and integration. ETL pipelines can be seen as concentrating recurrent activities that 
become more formalized into a scripted process. General-purpose data curation 
platforms should target domain experts, trying to provide tools that are usable for 
people outside the computer science/information technology background. 


Algorithmic Validation/Annotation Another major direction for reducing the 
cost of data curation is related to the automation of data curation activities. 
Algorithms are becoming more intelligent with advances in machine learning and 
artificial intelligence. It is expected that machine intelligence will be able to 
validate, repair, and annotate data within seconds, which might take hours for 
humans to perform (Kong et al. 2011). In effect, humans will be involved as 
required, e.g. for defining curation rules, validating hard instances, or providing 
data for training algorithms (Hassan et al. 2012). 


The simplest form of automation consists of scripting curation activities that are 
recurrent, creating specialized curation agents. This approach is used, for example, 
in Wikipedia (Wiki Bots) for article cleaning and detecting vandalism. Another 
automation process consists of providing an algorithmic approach for the validation 
or annotation of the data against reference standards (Data Curation Interview: 
Antony Williams 2014). This would contribute to a “likesonomy” where both 
humans and algorithms could provide further evidence in favour or against data 
(Data Curation Interview: Antony Williams 2014). These approaches provide a way 
to automate more recurrent parts of the curation tasks and can be implemented 
today in any curation pipeline (there are no major technological barriers). However, 
the construction of these algorithmic or reference bases has a high cost effort 
(in terms of time consumption and expertise), since they depend on an explicit 
formalization of the algorithm or the reference criteria (rules). 


Data Curation Automation More sophisticated automation approaches that could 
alleviate the need for the explicit formalization of curation activities will play a 
fundamental role in reducing the cost of data curation. There is significant potential 
for the application of machine learning in the data curation field. Two research 
areas that can impact data curation automation are: 


e Curating by Demonstration (CbD)/Induction of Data Curation Workflows: 
Programming by example [or programming by demonstration (PbD)] (Cypher 
1993; Flener and Schmid 2008; Lieberman 2001) is a set of end user develop- 
ment approaches in which user actions on concrete instances are generalized into 
a program. PbD can be used to allow distribution and amplification of the system 
development tasks by allowing users to become programmers. Despite being a 
traditional research area, and with research on PbD data integration (Tuchinda 
et al. 2007, 2011), PbD methods have not been extensively applied into data 
curation systems. 
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¢ Evidence-based Measurement Models of Uncertainty over Data: The quan- 
tification and estimation of generic and domain-specific models of uncertainty 
from distributed and heterogeneous evidence bases can provide the basis for the 
decision on what should be delegated or validated by humans and what can be 
delegated to algorithmic approaches. IBM Watson is an example of a system that 
uses at its centre a statistical model to determine the probability of an answer 
being correct (Ferrucci et al. 2010). Uncertainty models can also be used to route 
tasks according to the level of expertise, minimizing the cost and maximizing the 
quality of data curation. 


6.6.2.4 Human—Data Interaction 


Interactivity and Ease of Curation Actions Data interaction approaches that 
facilitate data transformation and access are fundamental for expanding the spec- 
trum of data curators’ profiles. There are still major barriers for interacting with 
structured data and the process of querying, analysing, and modifying data inside 
databases is in most cases mediated by IT professionals or domain-specific appli- 
cations. Supporting domain experts and casual users in querying, navigating, 
analysing, and transforming structured data is a fundamental functionality in data 
curation platforms. 


According to Carole Goble “from a big data perspective, the challenges are around 
finding the slices, views or ways into the dataset that enables you to find the bits that 
need to be edited, changed” (Data Curation Interview: Carole Goble 2014). There- 
fore, appropriate summarization and visualization of data is important not only 
from the usage perspective but also from the maintenance perspective (Hey and 
Trefethen 2004). Specifically, for the collaborative methods of data cleaning, it is 
fundamental to enable the discovery of anomalies in both structured and unstruc- 
tured data. Additionally, making data management activities more mobile and 
interactive is required as mobile devices overtake desktops. The following tech- 
nologies provide direction towards better interaction: 


e Data-Driven Documents’ (D3.js): D3.js is library for displaying interactive 
graphs in web documents. This library adheres to open web standard such as 
HTMLS5S, SVG, and CSS, to enable powerful visualizations with open source 
licensing. 

e Tableau®: This software allows users to visualize multiple dimensions of rela- 
tional databases. Furthermore it enables visualization of unstructured data 
through third-party adapters. Tableau has received a lot of attention due to its 
ease of use and free access public plan. 


5 http://d3js.org/ 
6 http://www.tableausoftware.com/public/ 
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e Open Refine’: This open source application allows users to clean and transform 
data from a variety of formats such as CSV, XML, RDF, JSON, etc. Open Refine 
is particularly useful for finding outliers in data and checking the distribution of 
values in columns through facets. It allows data reconciliation with external data 
sources such as Freebase and OpenCorporates.* 


Structured query languages such as SQL are the default approach for interacting 
with databases, together with graphical user interfaces that are developed as a 
façade over structured query languages. The query language syntax and the need 
to understand the schema of the database are not appropriate for domain experts to 
interact and explore the data. Querying progressively more complex structured 
databases and dataspaces will demand different approaches suitable for different 
tasks and different levels of expertise (Franklin et al. 2005). New approaches for 
interacting with structured data have evolved from the early research stage and can 
provide the basis for new suites of tools that can facilitate the interaction between 
user and data. Examples are keyword search, visual query interfaces, and natural 
language query interfaces over databases (Franklin et al. 2005; Freitas et al. 2012a, 
b; Kaufmann and Bernstein 2007). Flexible approaches for database querying 
depend on the ability of the approach to interpret the user query intent, matching 
it with the elements in the database. These approaches are ultimately dependent on 
the creation of semantic models that support semantic approximation (Freitas 
et al. 2011). Despite going beyond the proof-of-concept stage, these functionalities 
and approaches have not migrated to commercial-level applications. 


6.6.2.5 Trust 


Provenance Management As data reuse grows, the consumer of third-party data 
needs to have mechanisms in place to verify the trustworthiness and the quality of 
the data. Some of the data quality attributes can be evident by the data itself, while 
others depend on an understanding of the broader context behind the data, i.e. the 
provenance of the data, the processes, artefacts, and actors behind the data creation. 

Capturing and representing the context in which the data was generated and 
transformed and making it available for data consumers is a major requirement for 
data curation for datasets targeted towards third-party consumers. Provenance 
standards such as W3C PROV’ provide the grounding for the interoperable repre- 
sentation of the data. However, data curation applications still need to be 
instrumented to capture provenance. Provenance can be used to explicitly capture 
and represent the curation decisions that are made (Data Curation Interview: Paul 
Groth 2014). However, there is still a relatively low adoption on provenance 


4 https://github.com/OpenRefine/OpenRefine/wiki 
8 https://www.opencorporates.com 
? http://www.w3.org/TR/prov-primer/ 
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capture and management in data applications. Additionally, manually evaluating 
trust and quality from provenance data can be a time-consuming process. The 
representation of provenance needs to be complemented by automated approaches 
to derive trust and assess data quality from provenance metadata, under the context 
of a specific application. 


Fine-Grained Permission Management Models and Tools Allowing large 
groups of users to collaborate demands the creation of fine-grained permission/rights 
associated with curation roles. Most systems today have a coarse-grained permission 
system, where system stewards oversee general contributors. While this mechanism 
can fully address the requirements of some projects, there is a clear demand for more 
fine-grained permission systems, where permissions can be defined at a data item 
level (Qin and Atluri 2003; Ryutov et al. 2009) and can be assigned in a distributed 
way. In order to support this fine-grained control, the investigation and development 
of automated methods for permissions inference and propagation (Kirrane 
et al. 2013), as well as low-effort distributed permission assignment mechanisms, 
is of primary importance. Analogously, similar methods can be applied to a fine- 
grained control of digital rights (Rodriguez-Doncel et al. 2013). 


6.6.2.6 Standardization and Interoperability 


Standardized Data Model and Vocabularies for Data Reuse A large part of the 
data curation effort consists of integrating and repurposing data created under 
different contexts. In many cases this integration can involve hundreds of data 
sources. Data model standards such as the Resource Description Framework 
(RDF)!° facilitate data integration at the data model level. The use of Universal 
Resource Identifiers (URIs) in the identification of data entities works as a 
web-scale open foreign key, which promotes the reuse of identifiers across different 
datasets, facilitating a distributed data integration process. 

The creation of terminologies and vocabularies is a critical methodological step 
in a data curation project. Projects such as the New York Times (NYT) Index 
(Curry et al. 2010) or the Protein Data Bank (PDB) (Bernstein et al. 1977) prioritize 
the creation and evolution of a vocabulary that can serve to represent and annotate 
the data domain. In the case of PDB, the vocabulary expresses the representation 
needs of a community. The use of shared vocabularies is part of the vision of the 
linked data web (Berners-Lee 2009) and it is one methodological tool that can be 
used to facilitate semantic interoperability. While the creation of a vocabulary is 
more related to a methodological dimension, semantic search, schema mapping, or 
ontology alignment approaches (Shvaiko and Euzenat 2005; Freitas et al. 2012a, b) 
are central for reducing the burden of manual vocabulary mapping on the end user 
side, reducing the burden for terminological reuse (Freitas et al. 2012a, b). 


10 http://www.w3.org/TR/rdf1 1-primer/ 
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Improved Integration and Communication between Curation Tools Data is 
created and curated in different contexts and using different tools (which are 
specialized to satisfy different data curation needs). For example, a user may 
analyse possible data inconsistencies with a visualization tool, do schema mapping 
with a different tool, and then correct the data using a crowdsourcing platform. The 
ability to move the data seamlessly between different tools and capture user 
curation decisions and data transformations across different platforms is fundamen- 
tal to support more sophisticated data curation operations that may demand highly 
specialized tools to make the final result trustworthy (Data Curation Interview: Paul 
Groth 2014; Data Curation Interview: James Cheney 2014). The creation of stan- 
dardized data models and vocabularies (such as W3C PROV) addresses part of the 
problem. However, data curation applications need to be adapted to capture and 
manage provenance and to provide better adoption over existing standards. 


6.6.2.7 Data Curation Models 


Minimum Information Models for Data Curation Despite recent efforts in the 
recognition and understanding behind the field of data curation (Palmer et al. 2013; 
Lord et al. 2004), the processes behind it still need to be better formalized. The 
adoption of methods such as minimum information models (La Novere et al. 2005) 
and their materialization in tools is one example of methodological improvement 
that can provide a minimum quality standard for data curators. In eScience, 
MIRIAM (minimum information required in the annotation of models) (Laibe 
and Le Novére 2007) is an example of a community-level effort to standardize 
the annotation and curation processes of quantitative models of biological systems. 


Curating Nanopublications, Coping with the Long Tail of Science With the 
increase in the amount of scholarly communication, it is increasingly difficult to 
find, connect, and curate scientific statements (Mons and Velterop 2009; Groth 
et al. 2010). Nanopublications are core scientific statements with associated con- 
texts (Groth et al. 2010), which aim at providing a synthetic mechanism for 
scientific communication. Nanopublications are still an emerging paradigm, 
which may provide a way for the distributed creation of semi-structured data in 
both scientific and non-scientific domains. 


Investigation of Theoretical Principles and Domain-Specific Models Models 
for data curation should evolve from the ground practice into a more abstract 
description. The advancement of automated data curation algorithms will depend 
on the definition of theoretical models and on the investigation of the principles 
behind data curation (Buneman et al. 2008). Understanding the causal mechanisms 
behind workflows (Cheney 2010) and the generalization conditions behind data 
transportability (Pearl and Bareinboim 2011) are examples of theoretical models 
that can impact data curation, guiding users towards the generation and represen- 
tation of data that can be reused in broader contexts. 
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6.6.2.8 Unstructured and Structured Data Integration 


Entity Recognition and Linking Most of the information on the web and in 
organizations is available as unstructured data (text, videos, etc.). The process of 
making sense of information available as unstructured data is time-consuming: 
differently from structured data, unstructured data cannot be directly compared, 
aggregated, and operated. At the same time, unstructured data holds most of the 
information of the long tail of data variety (Fig. 6.2). 

Extracting structured information from unstructured data is a fundamental step 
for making the long tail of data analysable and interpretable. Part of the problem can 
be addressed by information extraction approaches (e.g. relation extraction, entity 
recognition, and ontology extraction) (Freitas et al. 2012a, b; Schutz and Buitelaar 
2005; Han et al. 2011; Data Curation Interview: Helen Lippell 2014). These tools 
extract information from text and can be used to automatically build semi-struc- 
tured knowledge from text. There are information extraction frameworks that are 
mature to certain classes of information extraction problems, but their adoption 
remains limited to early adopters (Curry et al. 2010; Data Curation Interview: Helen 
Lippell 2014). 


Use of Open Data to Integrate Structured and Unstructured Data Another 
recent shift in this area is the availability of large-scale structured data resources, in 
particular open data, which is supporting information extraction. For example, 
entities in open datasets such as DBpedia (Auer et al. 2007) and Freebase 
(Bollacker et al. 2008) can be used to identify named entities (people, places, and 
organizations) in texts, which can be used to categorize and organize text contents. 
Open data in this scenario works as a common-sense knowledge base for entities 
and can be extended with domain-specific entities inside organizational environ- 
ments. Named entity recognition and linking tools such as DBpedia Spotlight 
(Mendes et al. 2011) can be used to link structured and unstructured data. 

Complementarily, unstructured data can be used to provide a more comprehen- 
sive description for structured data, improving content accessibility and semantics. 
Distributional semantic models, semantic models that are built from large-scale 
collections (Freitas et al. 2012a, b), can be applied to structured databases (Freitas 
and Curry 2014) and are examples of approaches that can be used to enrich the 
semantics of the data. 


Natural Language Processing Pipelines The Natural Language Processing 
(NLP) community has mature approaches and tools that can be directly applied to 
projects that deal with unstructured data. Open source projects such as Apache 
UIMA"' facilitate the integration of NLP functionalities into other systems. Addi- 
tionally, strong industry use cases such as IBM Watson (Ferrucci et al. 2010), 
Thomson Reuters, The New York Times (Curry et al. 2010), and the Press 


1 http://uima.apache.org/ 
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Association (Data Curation Interview: Hellen Lippell) are shifting the perception of 
NLP techniques from the academic to the industrial field. 


6.7 Sectors Case Studies for Big Data Curation 


In this section, case studies are discussed that cover different data curation pro- 
cesses over different domains. The purpose behind the case studies is to capture the 
different workflows that have been adopted or designed in order to deal with data 
curation in the big data context. 


6.7.1 Health and Life Sciences 
6.7.1.1 ChemSpider 


ChemSpider"? is a search engine that provides free access to the structure-centric 
chemical community. It has been designed to aggregate and index chemical struc- 
tures and their associated information into a single searchable repository. 
ChemSpider contains tens of millions of chemical compounds with associated 
data and is serving as a data provider to websites and software tools. Available 
since 2007, ChemSpider has collated over 300 data sources from chemical vendors, 
government databases, private laboratories, and individuals. Used by chemists for 
identifier conversion and predictions, ChemSpider datasets are also heavily lever- 
aged by chemical vendors and pharmaceutical companies as pre-competitive 
resources for experimental and clinical trial investigation. 

Data curation in ChemSpider consists of the manual annotation and correction of 
data (Pence and Williams 2010). This may include changes to the chemical 
structures of a compound, addition or deletion of identifiers, associating links 
between a chemical compound, its related data sources, etc. ChemSpider supports 
two different ways for curators to help in curating data at ChemSpider: 


¢ Post comments on a record in order to highlight the need for appropriate action 
by a master curator. 

e As a registered member with curation rights, directly curate the data or remove 
erroneous data. 


ChemSpider adopts a meritocratic model for their curation activities. Normal 
curators are responsible for deposition, which is checked, and verified by 
master curators. Normal curators in turn can be invited to become masters after 
some qualifying period of contribution. The platform has a blended human and 


12 http://www.chemspider.com 
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computer-based curation process. Robotic curation uses algorithms for error cor- 
rection and data validation at deposition time. 

ChemSpider uses a mixture of computational approaches to perform certain 
levels of data validation. They have built their own chemical data validation tool, 
which is called CVSP (chemical validation and standardization platform). CVSP 
helps chemists to check chemicals to determine whether or not they are validly 
represented, or if there are any data quality issues so that they can flag those quality 
issues easily and efficiently. 

Using the open community model, ChemSpider distributes its curation activity 
across its community using crowdsourcing to accommodate massive growth rates 
and quality issues. They use a wiki-like approach for people to interact with the 
data, so that they can annotate it, validate it, curate it, flag it, and delete 
it. ChemSpider is in the process of implementing an automated recognition system 
that will measure the contribution effort of curators through the data validation and 
engagement process. The contribution metrics can be publicly viewable and acces- 
sible through a central profile for the data curator. 


6.7.1.2 Protein Data Bank 


The Research Collaboratory for Structural Bioinformatics Protein Data Bank’? 
(RCSB PDB) is a group dedicated to improve the understanding of the functions 
of biological systems through the study of 3D structure of biological macromole- 
cules. The PDB has had over 300 million dataset downloads. 

A significant amount of the curation process at PDB consists of providing 
standardized vocabulary for describing the relationships between biological enti- 
ties, varying from organ tissue to the description of the molecular structure. The use 
of standardized vocabularies helps with the nomenclature used to describe protein 
and small molecule names and their descriptors present in the structure entry. The 
data curation process covers the identification and correction of inconsistencies 
over the 3D protein structure and experimental data. In order to implement a global 
hierarchical governance approach to the data curation workflow, PDB staff review 
and annotate each submitted entry before robotic curation checks for plausibility as 
part of the data deposition, processing, and distribution. The data curation effort is 
distributed across their sister sites. 

Robotic curation automates the data validation and verification. Human curators 
contribute to the definition of rules for the detection of inconsistencies. The curation 
process is also propagated retrospectively, where errors found in the data are 
corrected retrospectively to the archives. Up-to-date versions of the datasets are 
released on a weekly basis to keep all sources consistent with the current standards 
and to ensure good data curation quality. 


13 http://www.pdb.org 
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6.7.1.3 FoldIt 


Foldit (Good and Su 2011) is a popular example of a human computation applied to 
a complex problem, i.e. finding patterns of protein folding. The developers of Foldit 
have used gamification to enable human computation. Through these games people 
can predict protein structure that might help in targeting drugs at particular disease. 
Current computer algorithms are unable to deal with the exponentially high number 
of possible protein structures. To overcome this problem, Foldit uses competitive 
protein folding to generate the best proteins (Eiben et al. 2012). 


6.7.2 Media and Entertainment 
6.7.2.1 Press Association 


Press Association (PA) is the national news agency for the UK and Ireland and a 
leading multimedia content provider across web, mobile, broadcast, and print. For 
the last 145 years, PA has been providing feeds (text, data, photos, and videos) to 
major UK media outlets as well as corporate customers and the public sector. 

The objective of data curation at Press Association is to select the most relevant 
information for its customers, classifying, enriching, and distributing it in ways that 
can be readily consumed. The curation process at Press Association employs a large 
number of curators in the content classification process, working over a large 
number of data sources. A curator inside PA is an analyst who collects, aggregates, 
classifies, normalizes, and analyses the raw information coming from different data 
sources. Since the nature of the information analysed is typically high volume and 
near real time, data curation is a big challenge inside the company and the use of 
automated tools plays an important role in this process. In the curation process, 
automatic tools provide a first level triage and classification, which is further refined 
by the intervention of human curators as shown in Fig. 6.3. 

The data curation process starts with an article submitted to a platform which 
uses a set of linguistic extraction rules over unstructured text to automatically 
derive tags for the article, enriching it with machine readable structured data. A 
data curator then selects the terms that better describe the contents and inserts new 
tags if necessary. The tags enrich the original text with the general category of the 
analysed contents, while also providing a description of specific entities (places, 
people, events, facts) that are present in the text. The metadata manager then 
reviews the classification and the content is published online. 
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Fig. 6.3 Press Association content and metadata pattern workflow 
6.7.2.2 The New York Times 


The New York Times (NYT) is the largest metropolitan and the third largest 
newspaper in the United States. The company has a long history of the curation 
of its articles in its 100-year-old curated repository (NYT Index). 

The New York Times’ curation pipeline (see Fig. 6.4) starts with an article getting 
out of the newsroom. The first level curation consists of the content classification 
process done by the editorial staff, which consists of several hundred journalists. Using 
a web application, a member of the editorial staff submits the new article through a 
rule-based information extraction system (in this case, SAS Teragram'*), Teragram 
uses a set of linguistic extraction rules, which are created by the taxonomy managers 
based on a subset of the controlled vocabulary used by the Index Department. 
Teragram suggests tags based on the index vocabulary that can potentially describe 
the content of the article (Curry et al. 2010). The member of the editorial staff then 
selects the terms that better describe the contents and inserts new tags if necessary. 

Taxonomy managers review the classification and the content is published 
online, providing continuous feedback into the classification process. In a later 
stage, the article receives a second level curation by the index department, which 
appends additional tags and a summary of the article to the stored resource. 


6.7.3 Retail 
6.7.3.1 eBay 


eBay is one of the most popular online marketplaces that caters for millions of 
products and customers. eBay has employed human computation to solve two 


14 SAS Teragram http://www.teragram.com 
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Fig. 6.4 The New York Times article classification curation workflow 


important issues of data quality: managing product taxonomies and finding identi- 
fiers in product descriptions. Crowdsourced workers help eBay in improving the 
speed and quality of product classification algorithms at lower costs. 


6.7.3.2 Unilever 


Unilever is one of the world’s largest manufacturers of consumer goods, with global 
operations. Unilever utilized crowdsourced human computation within their mar- 
keting strategy for new products. Human computation was used to gather sufficient 
data about customer feedback and to analyse public sentiment of social media. 
Initially Unilever developed a set of machine-learning algorithms to conduct an 
analysis sentiment of customers across their product range. However, these senti- 
ment analysis algorithms were unable to account for regional and cultural differ- 
ences between target populations. Therefore, Unilever effectively improved the 
accuracy of sentiment analysis algorithms with crowdsourcing, by verifying the 
output algorithms and gathering feedback from an online crowdsourcing platform, 
i.e. Crowdflower. 


6.8 Conclusions 


With the growth in the number of data sources and of decentralized content 
generation, ensuring data quality becomes a fundamental issue for data manage- 
ment environments in the big data era. The evolution of data curation methods and 
tools is a cornerstone element for ensuring data quality at the scale of big data. 
Based on the evidence collected by an extensive investigation that included a 
comprehensive literature analysis, survey, interviews with data curation experts, 
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questionnaires, and case studies, the future requirements and emerging trends for 
data curation were identified. The analysis can provide to data curators, technical 
managers, and researchers an up-to-date view of the challenges, approaches, and 
opportunities for data curation in the big data era. 


Open Access This chapter is distributed under the terms of the Creative Commons Attribution- 
Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any 
noncommercial use, distribution, and reproduction in any medium, provided the original author(s) 
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Chapter 7 
Big Data Storage 


Martin Strohbach, Jorg Daubert, Herman Ravkin, and Mario Lischka 


7.1 Introduction 


This chapter provides an overview of big data storage technologies which served as 
an input towards the creation of a cross-sectorial roadmap for the development of 
big data technologies in a range of high-impact application domains. Rather than 
elaborating on concrete individual technologies, this chapter provides a broad 
overview of data storage technologies so that the reader may get a high level 
understanding about the capabilities of individual technologies and areas that 
require further research. Consequently, the social and economic impacts are 
described, and selected case studies illustrating the use of big data storage technol- 
ogies are provided. The full results of the analysis on big data storage can be found 
in Curry et al. (2014). 

The position of big data storage within the overall big data value chain can be 
seen in Fig. 7.1. Big data storage is concerned with storing and managing data in a 
scalable way, satisfying the needs of applications that require access to the data. 
The ideal big data storage system would allow storage of a virtually unlimited 
amount of data, cope both with high rates of random write and read access, flexibly 
and efficiently deal with a range of different data models, support both structured 
and unstructured data, and for privacy reasons, only work on encrypted data. 
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Fig. 7.1 Data storage in the big data value chain 


Obviously, all these needs cannot be fully satisfied. But over recent years many new 
storage systems have emerged that at least partly address these challenges.’ 

This chapter provides an overview of big data storage technologies and identifies 
some areas where further research is required. Big data storage technologies are 
referred to as storage technologies that in some way specifically address the 
volume, velocity, or variety challenge and do not fall in the category of relational 
database systems. This does not mean that relational database systems do not 
address these challenges, but alternative storage technologies such as columnar 
stores and clever combinations of different storage systems, e.g. using the Hadoop 
Distributed File System (HDFS), are often more efficient and less expensive (Marz 
and Warren 2014). 

Big data storage systems typically address the volume challenge by making use 
of distributed, shared nothing architectures. This allows addressing increased stor- 
age requirements by scaling out to new nodes providing computational power and 
storage. New machines can seamlessly be added to a storage cluster and the storage 
system takes care of distributing the data between individual nodes transparently. 

Storage solutions also need to cope with the velocity and variety of data. 
Velocity is important in the sense of query latencies, i.e. how long does it take to 
get a reply for a query? This is particularly important in the face of high rates of 
incoming data. For instance, random write access to a database can slow down 
query performance considerably if it needs to provide transactional guarantees. In 
contrast, variety relates to the level of effort that is required to integrate and work 
with data that originates from a large number of different sources. For instance, 
graph databases are suitable storage systems to address these challenges. 


'See for instance the map of 451 Research available at https://45 lresearch.com/state-of-the- 
database-landscape 
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Section 7.2 summarizes key insights and Sect. 7.3 illustrates the social and 
economic impact of data storage. Section 7.4 presents the current state-of-the-art 
including storage technologies and solutions for security and privacy. Section 7.5 
includes future requirements and emerging trends for data storage that will play an 
important role for unlocking the value hidden in large datasets. Section 7.6 presents 
three selected case studies, and the chapter is concluded in Sect. 7.7. 


7.2 Key Insights for Big Data Storage 


As a result of the analysis of current and future data storage technologies, a number 
of insights were gained relating to data storage technologies. It became apparent 
that big data storage has become a commodity business and that scalable storage 
technologies have reached an enterprise-grade level that can manage virtually 
unbounded volumes of data. Evidence is provided by the widespread use of 
Hadoop-based solutions offered by vendors such as Cloudera (2014a), 
Hortonworks (2014), and MapR (2014) as well as various NoSQL? database 
vendors, in particular those that use in-memory and columnar storage technologies. 
Compared to traditional relational database management systems that rely on 
row-based storage and expensive caching strategies, these novel big data storage 
technologies offer better scalability at lower operational complexity and costs. 
Despite these advances that improve the performance, scalability, and usability 
of storage technologies, there is still significant untapped potential for big data 
storage technologies, both for using and further developing the technologies: 


¢ Potential to Transform Society and Businesses across Sectors: Big data 
storage technologies are a key enabler for advanced analytics that have the 
potential to transform society and the way key business decisions are made. 
This is of particular importance in traditionally non-IT-based sectors such as 
energy. While these sectors face non-technical issues such as the lack of skilled 
big data experts and regulatory barriers, novel data storage technologies have the 
potential to enable new value-generating analytics in and across various indus- 
trial sectors. 

e Lack of Standards Is a Major Barrier: The history of NoSQL is based on 
solving specific technological challenges which lead to a range of different 
storage technologies. The large range of choices coupled with the lack of 
standards for querying the data makes it harder to exchange data stores as it 
may tie application specific code to a certain storage solution. 

e Open Scalability Challenges in Graph-Based Data Stores: Processing data 
based on graph data structures is beneficial in an increasing amount of applica- 
tions. It allows better capture of semantics and complex relationships with other 


? NoSQL is typically referred to as “Not only SQL”. 
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pieces of information coming from a large variety of different data sources, and 
has the potential to improve the overall value that can be generated by analysing 
the data. While graph databases are increasingly used for this purpose, it remains 
hard to efficiently distribute graph-based data structure across computing nodes. 

¢ Privacy and Security Is Lagging Behind: Although there are several projects 
and solutions that address privacy and security, the protection of individuals and 
securing their data lags behind the technological advances of data storage 
systems. Considerable research is required to better understand how data can 
be misused, how it needs to be protected and integrated in big data storage 
solutions. 


7.3 Social and Economic Impact of Big Data Storage 


As emerging big data technologies and their use in different sectors show, the 
capability to store, manage, and analyse large amounts of heterogeneous data hints 
towards the emergence of a data-driven society and economy with huge transfor- 
mational potential (Manyika et al. 2011). Enterprises can now store and analyse 
more data at a lower cost while at the same time enhancing their analytical 
capabilities. While companies such as Google, Twitter, and Facebook are 
established players for which data constitutes the key asset, other sectors also 
tend to become more data driven. For instance, the health sector is an excellent 
example that illustrates how society can expect better health services by better 
integration and analysis of health-related data (iQuartic 2014). 

Many other sectors are heavily impacted by the maturity and cost-effectiveness 
of technologies that are able to handle big datasets. For instance, in the media sector 
the analysis of social media has the potential to transform journalism by summa- 
rizing news created by a large amount of individuals. In the transport sector, the 
consolidated data management integration of transport systems has the potential to 
enable personalized multimodal transportation, increasing the experience of trav- 
ellers within a city and at the same time helping decision-makers to better manage 
urban traffic. In all of these areas, NoSQL storage technologies prove a key enabler 
to efficiently analyse large amounts of data and create additional business value. 

On a cross-sectorial level, the move towards a data-driven economy can be seen 
by the emergence of data platforms such as datamarket.com (Gislason 2013), 
infochimp.com, and open data initiatives of the European Union such as open- 
data.europa.eu and other national portals (e.g. data.gov, data.gov.uk, data.gov.sg) 
(Ahmadi Zeleti et al. 2014). Technology vendors are supporting the move towards a 
data-driven economy as can be seen by the positioning of their products and 
services. For instance, Cloudera is offering a product called the enterprise data 
hub (Cloudera 2014b), an extended Hadoop ecosystem that is positioned as a data 
management and analysis integration point for the whole company. 

Further to the benefits described above, there are also threats to big data storage 
technologies that must be addressed to avoid any negative impact. This relates for 
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instance to the challenge of protecting the data of individuals and reducing the 
energy consumption of data centres (Koomey 2008). 


7.4 Big Data Storage State-of-the-Art 


This section provides an overview of the current state-of-the-art in big data storage 
technologies. Section 7.4.1 describes the storage technologies, and Sect. 7.4.2 pre- 
sents technologies related to secure and privacy-preserving data storage. 


7.4.1 Data Storage Technologies 


During the last decade, the need to deal with the data explosion (Turner et al. 2014) 
and the hardware shift from scale-up to scale-out approaches led to an explosion of 
new big data storage systems that shifted away from traditional relational database 
models. These approaches typically sacrifice properties such as data consistency in 
order to maintain fast query responses with increasing amounts of data. Big data 
stores are used in similar ways as traditional relational database management 
systems, e.g. for online transactional processing (OLTP) solutions and data ware- 
houses over structured or semi-structured data. Particular strengths are in handling 
unstructured and semi-structured data at large scale. 

This section assesses the current state-of-the-art in data store technologies that 
are capable of handling large amounts of data, and identifies data store related 
trends. Following are differing types of storage systems: 


¢ Distributed File Systems: File systems such as the Hadoop File System (HDFS) 
(Shvachko et al. 2010) offer the capability to store large amounts of unstructured 
data in a reliable way on commodity hardware. Although there are file systems 
with better performance, HDFS is an integral part of the Hadoop framework 
(White 2012) and has already reached the level of a de-facto standard. It has 
been designed for large data files and is well suited for quickly ingesting data and 
bulk processing. 

e NoSQL Databases: Probably the most important family of big data storage 
technologies are NoSQL database management systems. NoSQL databases use 
data models from outside the relational world that do not necessarily adhere to 
the transactional properties of atomicity, consistency, isolation, and durability 
(ACID). 

e NewSQL Databases: A modern form of relational databases that aim for 
comparable scalability as NoSQL databases while maintaining the transactional 
guarantees made by traditional database systems. 

¢ Big Data Querying Platforms: Technologies that provide query facades in 
front of big data stores such as distributed file systems or NoSQL databases. The 
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main concern is providing a high-level interface, e.g. via SQL? like query 
languages and achieving low query latencies. 


7.4.1.1 NoSQL Databases 


NoSQL databases are designed for scalability, often by sacrificing consistency. 
Compared to relational databases, they often use low-level, non-standardized query 
interfaces, which make them more difficult to integrate in existing applications that 
expect an SQL interface. The lack of standard interfaces makes it harder to switch 
vendors. NoSQL databases can be distinguished by the data models they use. 


e Key-Value Stores: Key-value stores allow storage of data in a schema-less way. 
Data objects can be completely unstructured or structured and are accessed by a 
single key. As no schema is used, it is not even necessary that data objects share 
the same structure. 

e Columnar Stores: According to Wikipedia “A column-oriented DBMS is a 
database management system (DBMS) that stores data tables as sections of 
columns of data rather than as rows of data, like most relational DBMSs” 
(Wikipedia 2013). Such databases are typically sparse, distributed, and persis- 
tent multi-dimensional sorted maps in which data is indexed by a triple of a row 
key, column key, and a timestamp. The value is represented as an uninterrupted 
string data type. Data is accessed by column families, i.e. a set of related column 
keys that effectively compress the sparse data in the columns. Column families 
are created before data can be stored and their number is expected to be small. In 
contrast, the number of columns is unlimited. In principle columnar stores are 
less suitable when all columns need to be accessed. However in practice this is 
rarely the case, leading to superior performance of columnar stores. 

¢ Document Databases: In contrast to the values in a key-value store, documents 
are structured. However, there is no requirement for a common schema that all 
documents must adhere to as in the case for records in relational databases. Thus 
document databases are referred to as storing semi-structured data. Similar to 
key-value stores, documents can be queried using a unique key. However, it is 
possible to access documents by querying their internal structure, such as 
requesting all documents that contain a field with a specified value. The capa- 
bility of the query interface is typically dependent on the encoding format used 
by the databases. Common encodings include XML or JSON. 

e Graph Databases: Graph databases, such as Neo4J (2015), store data in graph 
structures making them suitable for storing highly associative data such as social 
network graphs. A particular flavour of graph databases are triple stores such as 
AllegroGraph (Franz 2015) and Virtuoso (Erling 2009) that are specifically 


° Here and throughout this chapter SQL refers to the Standard Query Language as defined in the 
ISO/IEC Standard 9075-1:2011. 
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designed to store RDF triples. However, existing triple store technologies are not 
yet suitable for storing truly large datasets efficiently. 


While in general NoSQL data stores scale better than relational databases, 
scalability decreases with increased complexity of the data model used by the 
data store. This particularly applies to graph databases that support applications 
that are both write and read intensive. One approach to optimize read access is to 
partition the graph into sub-graphs that are minimally connected between each 
other and to distribute these sub-graphs between computational nodes. However, as 
new edges are added to a graph the connectivity between sub-graphs may increase 
considerably. This may lead to higher query latencies due to increased networks 
traffic and non-local computations. Efficient sharding schemes must therefore 
carefully consider the overhead required for dynamically re-distributing graph data. 


7.4.1.2 NewSQL Databases 


NewSQL databases are a modern form of relational databases that aim for compa- 
rable scalability with NoSQL databases while maintaining the transactional guar- 
antees made by traditional database systems. According to Venkatesh and Nirmala 
(2012) they have the following characteristics: 


¢ SQL is the primary mechanism for application interaction 

e ACID support for transactions 

e A non-locking concurrency control mechanism 

e An architecture providing much higher per-node performance 

¢ A scale-out, shared-nothing architecture, capable of running on a large number 
of nodes without suffering bottlenecks 


The expectation is that NewSQL systems are about 50 times faster than tradi- 
tional OLTP RDBMS. For example, VoltDB (2014) scales linearly in the case of 
non-complex (single-partition) queries and provides ACID support. It scales for 
dozens of nodes where each node is restricted to the size of the main memory. 


7.4.1.3 Big Data Query Platforms 


Big data query platforms provide query facades on top of underlying big data stores 
that simplify querying the underlying data stores. They typically offer an SQL-like 
query interface for accessing the data, but differ in their approach and performance. 

Hive (Thusoo et al. 2009) provides an abstraction on top of the Hadoop Distrib- 
uted File System (HDFS) that allows structured files to be queried by an SQL-like 
query language. Hive executes the queries by translating queries in MapReduce 
jobs. As a consequence, Hive queries have a high latency even for small datasets. 
Benefits of Hive include the SQL-like query interface and the flexibility to evolve 
schemas easily. This is possible as the schema is stored independently from the data 
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and the data is only validated at query time. This approach is referred to as schema- 
on-read compared to the schema-on-write approach of SQL databases. Changing 
the schema is therefore a comparatively cheap operation. The Hadoop columnar 
store HBase is also supported by Hive. 

In contrast to Hive, Impala (Russel 2013) is designed for executing queries with 
low latencies. It re-uses the same metadata and SQL-like user interface as Hive but 
uses its own distributed query engine that can achieve lower latencies. It also 
supports HDFS and HBase as underlying data stores. 

Spark SQL (Shenker et al. 2013) is another low latency query façade that 
supports the Hive interface. The project claims that “it can execute Hive QL queries 
up to 100 times faster than Hive without any modification to the existing data or 
queries” (Shenker et al. 2013). This is achieved by executing the queries using the 
Spark framework (Zaharia et al. 2010) rather than Hadoop’s MapReduce 
framework. 

Finally, Drill is an open source implementation of Google’s Dremel (Melnik 
et al. 2002) that similar to Impala is designed as a scalable, interactive ad-hoc query 
system for nested data. Drill provides its own SQL-like query language DrQL that 
is compatible with Dremel, but is designed to support other query languages such as 
the Mongo Query Language. In contrast to Hive and Impala, it supports a range of 
schema-less data sources, such as HDFS, HBase, Cassandra, MongoDB, and SQL 
databases. 


7.4.1.4 Cloud Storage 


As cloud computing grows in popularity, its influence on big data grows as well. 
While Amazon, Microsoft, and Google build on their own cloud platforms, other 
companies including IBM, HP, Dell, Cisco, Rackspace, etc., build their proposal 
around OpenStack, an open source platform for building cloud systems (OpenStack 
2014). 

According to IDC (Grady 2013), by 2020 40 % of the digital universe “will be 
‘touched’ by cloud computing”, and “perhaps as much as 15 % will be maintained 
in a cloud”. 

Cloud in general, and particularly cloud storage, can be used by both enterprises 
and end users. For end users, storing their data in the cloud enables access from 
everywhere and from every device in a reliable way. In addition, end users can use 
cloud storage as a simple solution for online backup of their desktop data. Similarly 
for enterprises, cloud storage provides flexible access from multiple locations and 
quick and easy scale capacity (Grady 2013) as well as cheaper storage prices and 
better support based on economies of scale (CloudDrive 2013) with cost effective- 
ness especially high in an environment where enterprise storage needs are changing 
over time up and down. 

Technically cloud storage solutions can be distinguished between object and 
block storage. Object storage “is a generic term that describes an approach to 
addressing and manipulating discrete units of storage called objects” (Margaret 
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Rouse 2014a). In contrast, block storage data is stored in volumes also referred to as 
blocks. According to Margaret Rouse (2014b), “each block acts as an individual 
hard drive” and enables random access to bits and pieces of data thus working well 
with applications such as databases. 

In addition to object and block storage, major platforms provide support for 
relational and non-relational database-based storage as well as in-memory storage 
and queue storage. In cloud storage, there are significant differences that need to be 
taken into account in the application-planning phase: 


¢ As cloud storage is a service, applications using this storage have less control 
and may experience decreased performance as a result of networking. These 
performance differences need to be taken into account during design and imple- 
mentation stages. 

e Security is one of the main concerns related to public clouds. As a result the 
Amazon CTO predicts that in five years all data in the cloud will be encrypted by 
default (Vogels 2013). 

e Feature rich clouds like AWS supports calibration of latency, redundancy, and 
throughput levels for data access, thus allowing users to find the right trade-off 
between cost and quality. 


Another important issue when considering cloud storage is the supported con- 
sistency model (and associated scalability, availability, partition tolerance, and 
latency). While Amazon’s Simple Storage Service (S3) supports eventual consis- 
tency, Microsoft Azure blob storage supports strong consistency and at the same 
time high availability and partition tolerance. Microsoft uses two layers: (1) a 
stream layer “which provides high availability in the face of network partitioning 
and other failures”, and (2) a partition layer which “provides strong consistency 
guarantees” (Calder et al. 2011). 


7.4.2 Privacy and Security 


Privacy and security are well-recognized challenges in big data. The CSA Big Data 
Working Group published a list of Top 10 Big Data Security and Privacy Chal- 
lenges (Mora et al. 2012). The following are five of those challenges that are vitally 
important for big data storage. 


7.4.2.1 Security Best Practices for Non-relational Data Stores 


The security threats for NoSQL databases are similar to traditional RDBMS and 
therefore the same best practices should be applied (Winder 2012). However, many 
security measures that are implemented by default within traditional RDBMS are 
missing in NoSQL databases (Okman et al. 2011). Such measures would include 
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encryption of sensitive data, sandboxing of processes, input validation, and strong 
user authentication. 

Some NoSQL suppliers recommend the use of databases in a trusted environ- 
ment with no additional security or authentication measures in place. However, this 
approach is hardly reasonable when moving big data storage to the cloud. 

Security of NoSQL databases is getting more attention by security researchers 
and hackers, and security will further improve as the market matures. For example, 
there are initiatives to provide access control capabilities for NoSQL databases 
based on Kerberos authentication modules (Winder 2012). 


7.4.2.2 Secure Data Storage and Transaction Logs 


Particular security challenges for data storage arise due to the distribution of data. 
With auto-tiering, operators give away control of data storage to algorithms in order 
to reduce costs. Data whereabouts, tier movements, and changes have to be 
accounted for by transaction log. 

Auto-tiering strategies have to be carefully designed to prevent sensitive data 
being moved to less secure and thus cheaper tiers; monitoring and logging mech- 
anisms should be in place in order to have a clear view on data storage and data 
movement in auto-tiering solutions (Mora et al. 2012). 

Proxy re-encryption schemes (Blaze et al. 2006) can be applied to multi-tier 
storage and data sharing in order to ensure seamless confidentiality and authenticity 
(Shucheng et al. 2010). However, performance has to be improved for big data 
applications. Transaction logs for multi-tier operations systems are still missing. 


7.4.2.3 Cryptographically Enforced Access Control and Secure 
Communication 


Today, data is often stored unencrypted, and access control solely depends on a 
gate-like enforcement. However, data should only be accessible by authorized 
entities by the guarantees of cryptography—tlikewise in storage as well as in 
transmission. For these purposes, new cryptographic mechanisms are required 
that provide the required functionalities in an efficient and scalable way. 

While cloud storage providers are starting to offer encryption, cryptographic key 
material should be generated and stored at the client and never handed over to the 
cloud provider. Some products add this functionality to the application layer of big 
data storage, e.g., zNcrypt, Protegrity Big Data Protection for Hadoop, and the Intel 
Distribution for Apache Hadoop (now part of Cloudera). 

Attribute-based encryption (Goyal et al. 2006) is a promising technology to 
integrate cryptography with access control for big data storage (Kamara and Lauter 
2010; Lee et al. 2013; Li et al. 2013). 


7 Big Data Storage 129 


7.4.2.4 Security and Privacy Challenges for Granular Access Control 


Diversity of data is a major challenge due to equally diverse security requirements, 
e.g., legal restrictions, privacy policies, and other corporate policies. Fine-grained 
access control mechanisms are needed to assure compliance with these requirements. 

Major big data components use Kerberos (Miller et al. 1987) in conjunction 
with token-based authentication, and Access Control Lists (ACL) based upon users 
and jobs. However, more fine-grained mechanism, for instance Attribute-Based 
Access Control (ABAC) and eXtensible Access Control Markup Language 
(XACLM), are required to model the vast diversity of data origins and analytical 
usages. 


7.4.2.5 Data Provenance 


Integrity and history of data objects within value chains is crucial. Traditional 
provenance governs mostly ownership and usage. With big data however, the 
complexity of provenance metadata will increase (Glavic 2014). 

Initial efforts have been made to integrate provenance into the big data ecosys- 
tem (Ikeda et al. 2011; Sherif et al. 2013); however, secure provenance requires 
guarantees of integrity and confidentiality of provenance data in all forms of big 
data storage and remains an open challenge. Furthermore, the analysis of very large 
provenance graphs is computationally intensive and requires fast algorithms. 


7.4.2.6 Privacy Challenges in Big Data Storage 


Researchers have shown (Acquisti and Gross 2009) that big data analysis of 
publicly available information can be exploited to guess the social security number 
of a person. Some products selectively encrypt data fields to create reversible 
anonymity, depending on the access privileges. 

Anonymizing and de-identifying data may be insufficient as the huge amount of 
data may allow for re-identification. A roundtable discussion (Bollier and Firestone 
2010) advocated transparency on the handling of data and algorithms as well as a 
new deal on big data (Wu and Guo 2013) to empower the end user as the owner of 
the data. Both options not only involve organization transparency, but also techni- 
cal tooling such as Security & Privacy by Design and the results of the EEXCESS 
EU FP7 project (Hasan et al. 2013). 
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7.5 Future Requirements and Emerging Paradigms for Big 
Data Storage 


This section provides an overview of future requirements and emerging trends. 


7.5.1 Future Requirements for Big Data Storage 


Three key areas have been identified that can be expected to govern future big data 
storage technologies. This includes standardization of query interfaces, increasing 
support for data security, protection of users’ privacy, and the support of semantic 
data models. 


7.5.1.1 Standardized Query Interfaces 


In the medium to long-term NoSQL databases would greatly benefit from standard- 
ized query interfaces, similar to SQL for relational systems. Currently no standards 
exist for the individual NoSQL storage types beyond de-facto standard APIs for 
graph databases (Blueprints 2014) and the SPARQL data manipulation language 
(Aranda et al. 2013) supported by triplestore’s vendors. Other NoSQL databases 
usually provide their own declarative language or API, and standardization for 
these declarative languages is missing. 

While for some database categories (key/value, document, etc.) declarative 
language standardization is still missing, there are efforts discussing standardiza- 
tion needs. For instance the ISO/IEC JTC Study Group on big data has recently 
recommended that existing ISO/IEC standards committee should further investi- 
gate the “definition of standard interfaces to support non-relational data stores” 
(Lee et al. 2014). 

The definition of standardized interfaces would enable the creation of a data 
virtualization layer that would provide an abstraction of heterogeneous data storage 
systems as they are commonly used in big data use cases. Some requirements of a 
data virtualization layer have been discussed online in an Infoworld blog article 
(Kobielus 2013). 


7.5.1.2 Security and Privacy 


Interviews were conducted with consultants and end users of big data storage who 
have responsibility for security and privacy, to gain their personal views and 
insights. Based upon these interviews and the gaps identified in Sect. 7.4.2, several 
future requirements for security and privacy in big data storage were identified. 
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Data Commons and Social Norms Data stored in large quantities will be subject 
to sharing as well as derivative work in order to maximize big data benefits. Today, 
users are not aware how big data processes their data (transparency), and it is not 
clear how big data users can share and obtain data efficiently. Furthermore, legal 
constraints with respect to privacy and copyright in big data are currently not 
completely clear within the EU. For instance, big data allows novel analytics 
based upon aggregated data from manifold sources. How does this approach affect 
private information? How can rules and regulations for remixing and derivative 
works be applied to big data? Such uncertainty may lead to a disadvantage of the 
EU compared to the USA. 


Data Privacy Big data storage must comply with EU privacy regulations such as 
Directive 95/46/EC when personal information is being stored. Today, heteroge- 
neous implementations of this directive render the storage of personal information 
in big data difficult. The General Data Protection Regulation (GDRP)—first pro- 
posed in 2012—is an on-going effort to harmonize data protection among EU 
member states. The GDRP is expected to influence future requirements for big 
data storage. As of 2014, the GDRP is subject to negotiations that make it difficult 
to estimate the final rules and start of enforcement. For instance, the 2013 draft 
version allows data subjects (persons) to request data controllers to delete personal 
data, which is often not sufficiently considered by big data storage solutions. 


Data Tracing and Provenance Tracing and provenance of data is becoming 
more and more important in big data storage for two reasons: (1) users want to 
understand where data comes from, if the data is correct and trustworthy, and what 
happens to their results and (2) big data storage will become subject to compliance 
rules as big data enters critical business processes and value chains. Therefore, big 
data storage has to maintain provenance metadata, provide provenance along the 
data processing chain, and offer user-friendly ways to understand and trace the 
usage of data. 


Sandboxing and Virtualization Sandboxing and virtualization of big data ana- 
lytics becomes more important in addition to access control. According to econo- 
mies of scale, big data analytics benefit from resource sharing. However, security 
breaches of shared analytical components lead to compromised cryptographic 
access keys and full storage access. Thus, jobs in big data analytics must be 
sandboxed to prevent an escalation of security breaches and therefore unauthorized 
access to storage. 


7.5.1.3 Semantic Data Models 


The multitude of heterogeneous data sources increases development costs, as 
applications require knowledge about individual data formats of each individual 
source. An emerging trend is the semantic web and in particular the semantic sensor 
web that tries to address this challenge. A multitude of research projects are 
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concerned with all levels of semantic modelling and computation. As detailed in 
this book, the need for semantic annotations has for instance been identified for the 
health sector. The requirement for data storage is therefore to support the large- 
scale storage and management of semantic data models. In particular trade-offs 
between expressivity and efficient storage and querying need to be further explored. 


7.5.2 Emerging Paradigms for Big Data Storage 


There are several new paradigms emerging for the storage of large and complex 
datasets. These new paradigms include, among others, the increased use of NoSQL 
databases, convergence with analytics frameworks, and managing data in a central 
data hub. 


7.5.2.1 Increased Use of NoSQL Databases 


NoSQL databases, most notably graph databases and columnar stores, are increas- 
ingly used as a replacement or complement to existing relational systems. 

For instance, the requirement of using semantic data models and cross linking 
data with many different data and information sources strongly drives the need to be 
able to store and analyse large amounts of data using graph-based models. How- 
ever, this requires overcoming the limitation of current graph-based systems as 
described above. For instance, Jim Webber states “Graph technologies are going to 
be incredibly important” (Webber 2013). In another interview, Ricardo Baeza- 
Yates, VP of Research for Europe and Latin America at Yahoo!, also states the 
importance of handling large-scale graph data (Baeza-Yates 2013). The Microsoft 
research project Trinity achieved a significant breakthrough in this area. Trinity is 
an in-memory data storage and distributed processing platform. By building on its 
very fast graph traversal capabilities, Microsoft researchers introduced a new 
approach to cope with graph queries. Other projects include Google’s knowledge 
graph and Facebook’s graph search that demonstrate the increasing relevance and 
growing maturity of graph technologies. 


7.5.2.2 In-Memory and Column-Oriented Designs 


Many modern high-performance NoSQL databases are based on columnar designs. 
The main advantage is that in most practical applications only a few columns are 
needed to access the data. Consequently storing data in columns allows faster 
access. In addition, column-oriented databases often do not support the expensive 
join operations from the relational world. Instead, a common approach is to use a 
single wide column table that stores the data based on a fully denormalized schema. 
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According to Michael Stonebraker “SQL vendors will all move to column stores, 
because they are wildly faster than row stores” (Stonebraker 201 2a). 

High-performance in-memory databases such as SAP HANA typically combine 
in-memory techniques with column-based designs. In contrast to relational systems 
that cache data in-memory, in-memory databases can use techniques such as anti- 
caching (DeBrabant et al. 2013). Harizopoulos et al. have shown that the most time 
for executing a query is spent on administrative tasks such as buffer management 
and locking (Harizopoulos et al. 2008). 


7.5.2.3 Convergence with Analytics Frameworks 


During the course of the project many scenarios have been identified that call for 
better analysis of available data to improve operations in various sectors. Techni- 
cally, this means an increased need for complex analytics that goes beyond simple 
aggregations and statistics. Stonebraker points out that the need for complex 
analytics will strongly impact existing data storage solutions (Stonebraker 2012b). 

As use case specific analytics are one of the most crucial components that are 
creating actual business value, it becomes increasingly important to scale up these 
analytics satisfying performance requirements, but also to reduce the overall devel- 
opment complexity and cost. Figure 7.2 shows some differences between using 
separate systems for data management and analytics versus integrated analytical 
databases. 


7.5.2.4 The Data Hub 


A central data hub that integrates all data in an enterprise is a paradigm that 
considers managing all company data as a whole, rather than in different, isolated 
databases managed by different organizational units. The benefit of a central data 
hub is that data can be analysed as a whole, linking various datasets owned by the 
company thus leading to deeper insights. 

Typical technical implementations are based on a Hadoop-based system that 
may use HDFS or HBase (Apache 2014) to store an integrated master dataset. On 
one hand, this master dataset can be used as ground truth and backup for existing 
data management systems, but it also provides the basis for advanced analytics that 
combine previously isolated datasets. 

Companies such as Cloudera use this paradigm to market their Hadoop distri- 
bution (Cloudera 2014b). Many use cases of enterprise data hub exist already. A 
case study in the financial sector is described in the next section. 
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Fig. 7.2 Paradigm shift from pure data storage systems to integrated analytical databases 


7.6 Sector Case Studies for Big Data Storage 


In this section three selected use cases are described that illustrate the potential and 
need for future storage technologies. The health use case illustrates how social 
media based analytics is enabled by NoSQL storage technologies. The second use 
case from the financial sector illustrates the emerging paradigm of a centralized 
data hub. The last use case from the energy sector illustrates the benefits of 
managing fine-grained Internet of Things (IoT) data for advanced analytics. An 
overview of the key characteristics of the use case can be found in Table 7.1. More 
case studies are presented in Curry et al. (2014). 
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Table 7.1 Key characteristics of selected big data storage case studies 


Storage 

Case study Sector | Volume technologies | Key requirements 

Treato: Social Health | >150 TB Cost-efficiency, scalability 
media based medi- limitations of relational DBs 
cation intelligence 


Centralized data Finance | Between several | Hadoop/ Building more accurate 

hub petabytes and HDFS models, scale of data, suit- 
over 150 PB ability for unstructured data 

Smart grid Energy | Tens of TB per | Hadoop Data volume, operational 
day challenges 


7.6.1 Health Sector: Social Media-Based Medication 
Intelligence 


Treato is an Israeli company that specializes in mining user-generated content from 
blogs and forums in order to provide brand intelligence services to pharmaceutical 
companies. As Treato is analysing the social web, it falls into the “classical” 
category of analysing large amounts of unstructured data, an application area that 
often asks for big data storage solutions. Treato’s service as a use case demonstrates 
the value of using big data storage technologies. The information is based on a case 
study published by Cloudera (2012), the company that provided the Hadoop 
distribution Treato has been using. 

While building its prototype, Treato discovered “that side effects could be 
identified through social media long before pharmaceutical companies or the 
Food & Drug Administration (FDA) issued warnings about them. For example, 
when looking at discussions about Singulair, an asthma medication, Treato found 
that almost half of UGC discussed mental disorders; the side effect would have 
been identifiable four years before the official warning came out.” (Cloudera 2012). 

Treato initially faced two major challenges: First, it needed to develop the 
analytical capabilities to analyse patient’s colloquial language and map that into a 
medical terminology suitable for delivering insights to its customers. Second, it was 
necessary to analyse large amounts of data sources as fast as possible in order to 
provide accurate information in real time. 

The first challenge, developing the analytics, has been addressed initially with a 
non-Hadoop system based on a relational database. With that system Treato was 
facing the limitation that it could only handle “data collection from dozens of 
websites and could only process a couple of million posts per day” (Cloudera 
2012). Thus, Treato was looking for a cost-efficient analytics platform that could 
fulfil the following key requirements: 


1. Reliable and scalable storage 

2. Reliable and scalable processing infrastructure 

3. Search engine capabilities for retrieving posts with high availability 
4. Scalable real-time store for retrieving statistics with high availability 
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As a result Treato decided on a Hadoop-based system that uses HBase to store 
the list of URLs to be fetched. The posts available at these URLs are analysed by 
using natural language processing in conjunction with their proprietary ontology. In 
addition “each individual post is indexed, statistics are calculated, and HBase tables 
are updated” (Cloudera 2012). 

According to the case study report, the Hadoop-based solution stores more than 
150 TB of data including 1.1 billion online posts from thousands of websites 
including about more than 11,000 medications and more than 13,000 conditions. 
Treato is able to process 150-200 million user posts per day. 

For Treato, the impact of the Hadoop-based storage and processing infrastruc- 
ture is that they obtain a scalable, reliable, and cost-effective system that may even 
create insights that would not have been possible without this infrastructure. The 
case study claims that with Hadoop, Treato improved execution time at least by a 
factor of six. This allowed Treato to respond to a customer request about a new 
medication within one day. 


7.6.2 Finance Sector: Centralized Data Hub 


As mapped out in the description of the sectorial roadmaps (Lobillo et al. 2013), the 
financial sector is facing challenges with respect to increasing data volumes and a 
variety of new data sources such as social media. Here use cases are described for 
the financial sector based on a Cloudera solution brief (Cloudera 2013). 

Financial products are increasingly digitalized including online banking and 
trading. As online and mobile access simplifies access to financial products, there 
is an increased level of activity leading to even more data. The potential of big data 
in this scenario is to use all available data for building accurate models that can help 
the financial sector to better manage financial risks. According to the solution brief, 
companies have access to several petabytes of data. According to Larry Feinsmith, 
managing director of JPMorgan Chase, his company is storing over 150 petabytes 
online and use Hadoop for fraud detection (Cloudera 2013). 

Secondly, new data sources add to both the volume and variety of available data. 
In particular, unstructured data from weblogs, social media, blogs, and other news 
feeds can help in customer relationship management, risk management, and maybe 
even algorithmic trading (Lobillo et al. 2013). Pulling all the data together in a 
centralized data hub enables more detailed analytics that can provide a competitive 
edge. However traditional systems cannot keep up with the scale, costs, and 
cumbersome integration of traditional extract, transform, load (ETL) processes 
using fixed data schemes, nor are they able to handle unstructured data. Big data 
storage systems however scale extremely well and can process both structured and 
unstructured data. 
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7.6.3 Energy: Device Level Metering 


In the energy sector, smart grid and smart meter management is an area that 
promises both high economic and environmental benefits. As depicted in Fig. 7.3, 
the introduction of renewable energies such as photovoltaic systems deployed on 
houses can cause grid instabilities. Currently grid operators have little knowledge 
about the last mile to energy consumers. Thus they are not able to appropriately 
react to instabilities caused at the very edges of the grid network. By analysing 
smart meter data sampled at second intervals, short-term forecasting of energy 
demands and managing the demand of devices such as heating and electrical cars 
becomes possible, thus stabilizing the grid. If deployed in millions of households 
the data volumes can reach petabyte scale, thus greatly benefiting from new storage 
technologies. Table 7.2 shows the data volume only for the raw data collected for 
one day. 

The Peer Energy Cloud (PEC) project (2014) is a public funded project that has 
demonstrated how smart meter data can be analysed and used for trading energy in 
the local neighbourhood, thus increasing the overall stability of the power grid. 
Moreover, it has successfully shown that by collecting more fine granular data, 
i.e. monitoring energy consumption of individual devices in the household, the 
accuracy of predicting the energy consumption of households can be significantly 
improved (Ziekow et al. 2013). As the data volumes increase it becomes 
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Fig. 7.3 Introduction of renewable energy at consumer sites changes the topology of the energy 
grid and requires new measurement points at the leaves of the grid 
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Table 7.2 Calculation of the amount of data sampled by smart meters 


Sampling rate 1 Hz 
Record size 50 Bytes 
Raw data per day and household 4.1 MB 
Raw data per day for 10 Mio customers ~39 TB 


increasingly difficult to handle the data with legacy relational databases (Strohbach 
et al. 2011). 


7.7 Conclusions 


The chapter contains an overview of current big data storage technologies as well as 
emerging paradigms and future requirements. The overview specifically included 
technologies and approaches related to privacy and security. Rather than focusing 
on detailed descriptions of individual technologies a broad overview was provided, 
and technical aspects that have an impact on creating value from large amounts of 
data highlighted. The social and economic impact of big data storage technologies 
was described, and three selected case studies in three different sectors were 
detailed, which illustrate the need for easy to use scalable technologies. 

It can be concluded that there is already a huge offering of big data storage 
technologies. They have reached a maturity level that is high enough that early 
adopters in various sectors already use or plan to use them. Big data storage often 
has the advantage of better scalability at a lower price tag and operational com- 
plexity. The current state of the art reflects that the efficient management of almost 
any size of data is not a challenge per se. Thus it has huge potential to transform 
business and society in many areas. 

It can also be concluded that there is a strong need to increase the maturity of 
storage technologies so that they fulfil future requirements and lead to a wider 
adoption, in particular in non-IT-based companies. The required technical improve- 
ments include the scalability of graph databases that will enable better handling of 
complex relationships, as well as further minimizing query latencies to big datasets, 
e.g. by using in-memory databases. Another major roadblock is the lack of stan- 
dardized interfaces to NoSQL database systems. The lack of standardization 
reduces flexibility and slows down adoption. Finally, considerable improvements 
for security and privacy are required. Secure storage technologies need to be further 
developed to protect the privacy of users. 

More details about big data storage technologies can be found in Curry 
et al. (2014). This report, in conjunction with the analysis of the public and 
10 industrial sectors (Zillner et al. 2014), has been used as a basis to develop the 
cross-sectorial roadmap described in this book. 
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Open Access This chapter is distributed under the terms of the Creative Commons Attribution- 
Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any 
noncommercial use, distribution, and reproduction in any medium, provided the original author(s) 
and source are credited. 

The images or other third party material in this book are included in the work’s Creative 
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Chapter 8 
Big Data Usage 


Tilman Becker 


8.1 Introduction 


One of the core business tasks of advanced data usage is the support of business 
decisions. Data usage is a wide field that is addressed in this chapter by viewing data 
usage from various perspectives, including the underlying technology stacks, trends 
in various sectors, the impact on business models, and requirements on human— 
computer interaction. 

The full life-cycle of information is covered in this book, with previous chapters 
covering data acquisition, storage, analysis, and curation. The position of big data 
usage within the overall big data value chain can be seen in Fig. 8.1. Data usage 
covers the business goals that need access to such data, its analyses, and the tools 
needed to integrate the analyses in business decision-making. 

The process of decision-making includes reporting, exploration of data (brows- 
ing and lookup), and exploratory search (finding correlations, comparisons, what-if 
scenarios, etc.). The business value of such information logistics is twofold: 
(1) control over the value chain and (2) transparency of the value chain. The former 
is generally independent from big data; the latter, however, provides opportunities 
and requirements for data markets and services. 

Big data influences the validity of data-driven decision-making in the future. 
Influencing factors are (1) the time range for decisions/recommendations, from 
short term to long term and (2) the various databases (in a non-technical sense) from 
past, historical data to current and up-to-date data. 

New data-driven applications will strongly influence the development of new 
markets. A potential blocker of such developments is always the need for new 
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Big Data Value Chain 
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Fig. 8.1 Data usage in the big data value chain 


partner networks (combination of currently separate capabilities), business pro- 
cesses, and markets. 

A special area of use cases for big data is the manufacturing, transportation, and 
logistics sector. These sectors are undergoing a transformational change as part of 
an industry-wide trend, called “Industry 4.0”, which originates in the digitization 
and interlinking of products, production facilities, and transportation infrastructure 
as part of the developing “Internet of Things”. Data usage has a profound impact in 
these sectors, e.g. applications of predictive analysis in maintenance are leading to 
new business models as the manufacturers of machinery are in the best position to 
provide big data-based maintenance. The emergence of cyber-physical systems 
(CPS) for production, transportation, logistics, and other sectors brings new chal- 
lenges for simulation and planning, for monitoring, control, and interaction 
(by experts and non-experts) with machinery or data usage applications. 

On a larger scale, new services and a new service infrastructure is required. 
Under the title “smart data” and smart data services, requirements for data and also 
service markets are formulated. Besides the technology infrastructure for the 
interaction and collaboration of services from multiple sources, there are legal 
and regulatory issues that need to be addressed. A suitable service infrastructure 
is also an opportunity for SMEs to take part in big data usage scenarios by offering 
specific services, e.g., through data usage service marketplaces. 

Access to data usage is given through specific tools and in turn through query 
and scripting languages that typically depend on the underlying data stores, their 
execution engines, APIs, and programming models. In Sect. 8.5.1, different techno- 
logy stacks and some of the trade-offs involved are discussed. Section 8.5.2 pre- 
sents general aspects of decision support, followed by a discussion of specific 
access to analysis results through visualization and new explorative interfaces. 
Human-computer interaction will play a growing role in decision support since 
many cases cannot rely on pre-existing models of correlation. In such cases, user 
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interfaces (e.g. in data visualization for visual analytics) must support an explora- 
tion of the data and their potential connections. Emerging trends and future 
requirements are presented in Sect. 8.6 with special emphasis on Industry 4.0 and 
the emerging need for smart data and smart services. 


8.2 Key Insights for Big Data Usage 


The key insights for big data usage identified are as follows: 


Predictive Analytics A prime example for the application of predictive analytics 
is in predictive maintenance based on sensor and context data to predict deviations 
from standard maintenance intervals. Where data points to a stable system, main- 
tenance intervals can be extended, leading to lower maintenance costs. Where data 
points to problems before reaching a scheduled maintenance, savings can be even 
higher if a breakdown, repair cost, and downtimes can be avoided. Information 
sources go beyond sensor data and tend to include environmental and context data, 
including usage information (e.g. high load) of the machinery. As predictive 
analysis depends on new sensors and data processing infrastructure, large manu- 
facturers are switching their business model and investing in new infrastructure 
themselves (realizing scale effects on the way) and leasing machinery to their 
customers. 


Industry 4.0 A growing trend in manufacturing is the employment of cyber- 
physical systems. It brings about an evolution of old manufacturing processes, on 
the one hand making available a massive amount of sensor and other data and on the 
other hand bringing the need to connect all available data through communication 
networks and usage scenarios that reap the potential benefits. Industry 4.0 stands for 
the entry of IT into the manufacturing industry and brings with it a number of 
challenges for IT support. This includes services for diverse tasks such as planning 
and simulation, monitoring and control, interactive use of machinery, logistics and 
enterprise resource planning (ERP), predictive analysis, and eventually prescriptive 
analysis where decision processes can be automatically controlled by data analysis. 


Smart Data and Service Integration When further developing the scenario for 
Industry 4.0 above, services that solve the tasks at hand come into focus. To enable 
the application of smart services to deal with the big data usage problems, there are 
technical and organizational matters. Data protection and privacy issues, regulatory 
issues, and new legal challenges (e.g. with respect to ownership issues for derived 
data) must all be addressed. 

On a technical level, there are multiple dimensions along which the interaction 
of services must be enabled: on a hardware level from individual machines, to 
facilities, to networks; on a conceptual level from intelligent devices to intelligent 
systems and decisions; on an infrastructure level from IaaS to PaaS and SaaS to new 
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services for big data usage and even to business processes and knowledge as a 
service. 


Interactive Exploration When working with large volumes of data in large 
variety, the underlying models for functional relations are oftentimes missing. 
This means data analysts have a greater need for exploring datasets and analyses. 
This is addressed through visual analytics and new and dynamic ways of data 
visualization, but new user interfaces with new capabilities for the exploration of 
data are needed. Integrated data usage environments provide support, e.g., through 
history mechanisms and the ability to compare different analyses, different para- 
meter settings, and competing models. 


8.3 Social and Economic Impact for Big Data Usage 


One of the most important impacts of big data usage scenarios is the discovery of 
new relations and dependencies in the data that lead, on the surface, to economic 
opportunities and more efficiency. On a deeper level, big data usage can provide a 
better understanding of these dependencies, making the system more transparent 
and supporting economic as well as social decision-making processes (Manyika 
et al. 2011). Wherever data is publicly available, social decision-making is 
supported; where relevant data is available on an individual-level, personal 
decision-making is supported. The potential for transparency through big data 
usage comes with a number of requirements: (1) regulations and agreements on 
data access, ownership, protection, and privacy, (2) demands on data quality 
(e.g. on the completeness, accuracy, and timeliness of data), and (3) access to the 
raw data as well as access to appropriate tools or services for big data usage. 

Transparency thus has an economic and social and personal dimension. Where 
the requirements listed above can be met, decisions become transparent and can be 
made in a more objective, reproducible manner, where the decision processes are 
open to involve further players. 

The current economic drivers of big data usage are large companies with access 
to complete infrastructures. These include sectors like advertising at Internet 
companies and sensor data from large infrastructures (e.g. smart grids or smart 
cities) or for complex machinery (e.g. airplane engines). In the latter examples, 
there is a trend towards even closer integration of data usage at large companies as 
the big data capabilities remain with the manufactures (and not the customers), 
e.g. when engines are only rented and the big data infrastructure is owned and 
managed by the manufacturers. 

There is a growing requirement for standards and accessible markets for data as 
well as for services to manage, analyse, and exploit further uses of data. Where such 
requirements are met, opportunities are created for SMEs to participate in more 
complex use cases for big data usage. Section 8.5.2.1 discusses these requirements 
for smart data and corresponding smart data services. 
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8.4 Big Data Usage State-of-the-Art 


This section provides an overview of the current state of the art in big data usage, 
addressing briefly the main aspects of the technology stacks employed and the 
subfields of decision support, predictive analysis, simulation, exploration, visual- 
ization, and more technical aspects of data stream processing. Future requirements 
and emerging trends related to big data usage will be addressed in Sect. 8.6. 


8.4.1 Big Data Usage Technology Stacks 


Big data applications rely on the complete data value chain that is covered in the BIG 
project, starting at data acquisition, including curation, storage, analysis, and being 
joined for data usage. On the technology side, a big data usage application relies on a 
whole stack of technologies that cover the range from data stores and their access to 
processing execution engines that are used by query interfaces and languages. 

It should be stressed that the complete big data technology stack can be seen as 
much broader, i.e., encompassing the hardware infrastructure, such as storage 
systems, servers, datacentre networking infrastructure, corresponding data organ- 
ization and management software, as well as a whole range of services ranging from 
consulting and outsourcing to support and training on the business side as well as 
the technology side. 

Actual user access to data usage is given through specific tools and in turn 
through query and scripting languages that typically depend on the underlying data 
stores, their execution engines, APIs, and programming models. Some examples 
include SQL for classical relational database management systems (RDBMS), 
Dremel and Sawzall for Google’s file system (GFS), and MapReduce, Hive, Pig, 
and Jaql for Hadoop-based approaches, Scope for Microsoft’s Dryad and 
CosmosFS, and many other offerings, e.g. Stratosphere’s Meteor/Sopremo and 
ASTERIX’s AQL/Algebricks. 

Analytics tools that are relevant for data usage include SystemT (IBM, for data 
mining and information extraction) and Matlab (U. Auckland and Mathworks, resp. 
for mathematical and statistical analysis), tools for business intelligence and ana- 
lytics (SAS Analytics (SAS), Vertica (HP), SPSS (IBM)), tools for search and 
indexing (Lucene and Solr (Apache)), and specific tools for visualization (Tableau, 
Tableau Software). Each of these tools has its specific area of application and 
covers different aspects of big data. 

The tools for big data usage support business activities that can be grouped into 
three categories: lookup, learning, and investigating. The boundaries are sometimes 
fuzzy and learning and investigating might be grouped as examples of exploratory 
search. Decision support needs access to data in many ways, and as big data more 


! Stratosphere is further developed in the Apache Flink project. 
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often allows the detection of previously unknown correlations, data access must be 
more often from interfaces that enable exploratory search and not mere access to 
predefined reports. 


8.4.1.1 Trade-Offs in Big Data Usage Technologies 


An in-depth case study analysis of a complete big data application was performed to 
determine the decisions involved in weighing the advantages and disadvantages of 
the various available components of a big data technology stack. Figure 8.2 shows 
the infrastructure used for Google’s YouTube Data Warehouse (YTDW) as detailed 
in Chattopadhyay (2011). Some of the core lessons learned by the YouTube team 
include an acceptable trade-off in functionality when giving priority to low-latency 
queries. This justified the decision to stick with the ([Dremel tool (for querying 
large datasets) that has acceptable drawbacks in expressive power (when compared 
to SQL-based tools), yet provides low-latency results and scales to what Google 
considers “medium” scales. Note, however, that Google is using “trillions of rows 
in seconds”, and running on “thousands of CPUs and petabytes of data”, processing 
“quadrillions of records per month”. While Google regards this as medium scale, 
this might be sufficient for many applications that are clearly in the realms of big 
data. Table 8.1 shows a comparison of various data usage technology components 
used in the YTDW, where latency refers to the time the systems need to answer 
request; scalability to the ease of using ever larger datasets; SQL refers to the (often 
preferred) ability to use SQL (or similar) queries; and power refers to the expressive 
power of search queries. 
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Fig. 8.2 The YouTube Data Warehouse (YTDW) infrastructure. Derived from Chattopadhyay 
(2011) 
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Table 8.1 Comparison of 


i cate i Sawzall Tenzing Dremel 

ata usage technologies use : : 

in YTDW. Source: Latency High Medium Low 

Chattopadhyay (2011) Scalability High High Medium 
SQL None High Medium 
Power High Medium Low 


8.4.2 Decision Support 


Current decision support systems—as far as they rely on static reports—use these 
techniques but do not allow sufficient dynamic usage to reap the full potential of 
exploratory search. However, in increasing order of complexity, these groups 
encompass the following business goals: 


Lookup: On the lowest level of complexity, data is merely retrieved for various 
purposes. These include fact retrieval and searches for known items, e.g. for 
verification purposes. Additional functionalities include navigation through 
datasets and transactions. 

Learning: On the next level, these functionalities can support knowledge 
acquisition and interpretation of data, enabling comprehension. Supporting 
functionalities include comparison, aggregation, and integration of data. Addi- 
tional components might support social functions for data exchange. Examples 
for learning include simple searches for a particular item (knowledge acqui- 
sition), e.g. a celebrity and their use in advertising (retail). A big data search 
application would be expected to find all related data and present an 
integrated view. 

Investigation: On the highest level of decision support systems, data can be 
analysed, accreted, and synthesized. This includes tool support for exclusion, 
negation, and evaluation. At this level of analysis, true discoveries are supported 
and the tools influence planning and forecasting. Higher levels of investigation 
(discovery) will attempt to find important correlations, say the influence of 
seasons and/or weather on sales of specific products at specific events. More 
examples, in particular of big data usage for high-level strategic business 
decisions, are given in Sect. 8.6 on future requirements. 


At an even higher level, these functionalities might be (partially) automated to 


provide predictive and even normative analyses. The latter refers to automatically 
derived and implemented decisions based on the results of automatic (or manual) 
analysis. However, such functions are beyond the scope of typical decision support 
systems and are more likely to be included in complex event processing (CEP) 
environments where the low latency of automated decision is weighed higher than 
the additional safety of a human-in-the-loop that is provided by decision support 
systems. 
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8.4.3 Predictive Analysis 


A prime example of predictive analysis is predictive maintenance based on big 
data usage. Maintenance intervals are typically determined as a balance between a 
costly, high frequency of maintenance and an equally costly danger of failure 
before maintenance. Depending on the application scenario, safety issues often 
mandate frequent maintenance, e.g., in the aerospace industry. However, in other 
cases the cost of machine failures is not catastrophic and determining maintenance 
intervals becomes a purely economic exercise. 

The assumption underlying predictive analysis is that given sufficient sensor 
information from a specific machine and a sufficiently large database of sensor and 
failure data from this machine or the general machine type, the specific time to 
failure of the machine can be predicted more accurately. This approach promises to 
lower costs due to: 


e Longer maintenance intervals as “unnecessary” interruptions of production 
(or employment) can be avoided when the regular time for maintenance is 
reached. A predictive model allows for an extension of the maintenance interval, 
based on current sensor data. 

e Lower number of failures as the number of failures occurring earlier than 
scheduled maintenance can be reduced based on sensor data and predictive 
maintenance calling for earlier maintenance work. 

¢ Lower costs for failures as potential failures can be predicted by predictive 
maintenance with a certain advance warning time, allowing for scheduling 
maintenance/exchange work, lowering outage times. 


8.4.3.1 New Business Model 


The application of predictive analytics requires the availability of sensor data for a 
specific machine (where “machine” is used as a fairly generic term) as well as a 
comprehensive dataset of sensor data combined with failure data. 

Equipping existing machinery with additional sensors, adding communication 
pathways from sensors to the predictive maintenance services, etc., can be a costly 
proposition. Based on experiencing reluctance from their customers in such invest- 
ments, a number of companies (mainly manufacturers of machines) have developed 
new business models addressing these issues. 

Prime examples are GE wind turbines and Rolls Royce airplane engines. Rolls 
Royce engines are increasingly offered for rent, with full-service contracts includ- 
ing maintenance, allowing the manufacturer to lift the benefits from applying 
predictive maintenance. By correlating the operational context with engine sensor 
data, failures can be predicted early, reducing (the costs of) replacements, 
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allowing for planned maintenance rather than just scheduled maintenance. GE 
OnPoint solutions offer similar service packages that are sold in conjunction with 
GE engines.” 


8.4.4 Exploration 


Exploring big datasets and the corresponding analytics results can be distributed 
across multiple sources and formats (e.g. new portals, travel blogs, social networks, 
web services, etc.). To answer complex questions—e.g. “Which astronauts have 
been on the moon?”, “Where is the next Italian restaurant with high ratings?”, 
“Which sights should I visit in what order?”—users have to start multiple requests 
to multiple, heterogeneous sources and media. Finally, the results have to be 
combined manually. 

Support for the human trial-and-error approach can add value by providing 
intelligent methods for automatic information extraction and aggregation to answer 
complex questions. Such methods can transform the data analysis process to 
become explorative and iterative. In a first phase, relevant data is identified and 
then a second learning phase context is added for such data. A third exploration 
phase allows various operations for deriving decisions from the data or 
transforming and enriching the data. 

Given the new complexity of data and data analysis available for exploration, 
there are a number of emerging trends in explorative interfaces that are discussed in 
Sect. 8.5.2.4 on complex exploration. 


8.4.5 Iterative Analysis 


An efficient, parallel processing of iterative data streams brings a number of 
technical challenges. Iterative data analysis processes typically compute analysis 
results in a sequence of steps. In every step, a new intermediate result or state is 
computed and updated. Given the high volumes in big data applications, compu- 
tations are executed in parallel, distributing, storing, and managing the state 
efficiently across multiple machines. Many algorithms need a high number of 
iterations to compute the final results, requiring low latency iterations to minimize 
overall response times. However, in some applications, the computational effort is 
reduced significantly between the first and the last iterations. Batch-based systems 
such as Map/Reduce (Dean and Ghemawat 2008) and Spark (Apache 2014) repeat 
all computations in every iteration even when the (partial) results do not change. 


2 See http://www.aviationpros.com/press_release/11239012/tui-orders-additional-genx-powered- 
boeing-787s 
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Truly iterative dataflow systems like Stratosphere (Stratosphere 2014) of special- 
ized graph systems like GraphLab (Low et al. 2012) and Google Pregel (Malewicz 
et al. 2010) exploit such properties and reduce the computational cost in every 
iteration. 

Future requirements on technologies and their applications in big data usage are 
described in Sect. 8.5.1.3, covering aspects of pipelines versus materialization and 
error tolerance. 


8.4.6 Visualization 


Visualizing the results of an analysis including a presentation of trends and other 
predictions by adequate visualization tools is an important aspect of big data usage. 
The selection of relevant parameters, subsets, and features is a crucial element of 
data mining and machine learning with many cycles needed for testing various 
settings. As the settings are evaluated on the basis of the presented analysis results, 
a high-quality visualization allows for a fast and precise evaluation of the quality of 
results, e.g., in validating the predictive quality of a model by comparing the results 
against a test dataset. Without supportive visualization, this can be a costly and slow 
process, making visualization an important factor in data analysis. 

For using the results of data analytics in later steps of a data usage scenario, for 
example, allowing data scientists and business decision-makers to draw conclu- 
sions from the analysis, a well-selected visual presentation can be crucial for 
making large result sets manageable and effective. Depending on the complexity 
of the visualizations, they can be computationally costly and hinder interactive 
usage of the visualization. 

However, explorative search in analytics results is essential for many cases of 
big data usage. In some cases, the results of a big data analysis will be applied only 
to a single instance, say an airplane engine. In many cases, though, the analysis 
dataset will be as complex as the underlying data, reaching the limits of classical 
statistical visualization techniques and requiring interactive exploration and ana- 
lysis (Spence 2006; Ward et al. 2010). In Shneiderman’s seminal work on visual- 
ization (Shneiderman 1996), he identifies seven types of tasks: overview, zoom, 
filter, details-on-demand, relate, history, and extract. 

Yet another area of visualization applies to data models that are used in many 
machine-learning algorithms and differ from traditional data mining and reporting 
applications. Where such data models are used for classification, clustering, recom- 
mendations, and predictions, their quality is tested with well-understood datasets. 
Visualization supports such validation and the configuration of the models and their 
parameters. 

Finally, the sheer size of datasets is a continuous challenge for visualization tools 
that is driven by technological advances in GPUs, displays, and the slow adoption 
of immersive visualization environments such as caves, VR, and AR. These aspects 
are covered in the fields of scientific and information visualization. 
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The following section elaborates the application of visualization for big data 
usage, known as visual analytics. Section 8.5.1.4 presents a number of research 
challenges related to visualization in general. 


8.4.6.1 Visual Analytics 


A definition of visual analytics, taken from Keim et al. (2010) recalls first mentions 
of the term in 2004. More recently, the term is used in a wider context, describing a 
new multidisciplinary field that combines various research areas including visual- 
isation, human-computer interaction, data analysis, data management, geo-spatial 
and temporal data processing, spatial decision support and statistics. 

The “Vs” of big data affect visual analytics in a number of ways. The volume of 
big data creates the need to visualize high dimensional data and their analyses and 
to display multiple data types such as linked graphs. In many cases interactive 
visualization and analysis environments are needed that include dynamically linked 
visualizations. Data velocity and the dynamic nature of big data calls for corres- 
pondingly dynamic visualizations that are updated much more often than previous, 
static reporting tools. Data variety presents new challenges for cockpits and 
dashboards. 

The main new aspects and trends are: 


e Interactivity, visual queries, (visual) exploration, multi-modal interaction 
(touchscreen, input devices, AR/VR) 

e Animations 

e User adaptivity (personalization) 

e Semi-automation and alerting, CEP (complex event processing), and BRE 
(business rule engines) 

¢ Large variety in data types, including graphs, animations, microcharts (Tufte), 
gauges (cockpit-like) 

e Spatiotemporal datasets and big data applications addressing geographic infor- 
mation systems (GIS) 

e Near real-time visualization. Sectors finance industry (trading), manufacturing 
(dashboards), oil/gas—CEP, BAM (business activity monitoring) 

¢ Data granularity varies widely 

e Semantics 


Use cases for visual analytics include multiple sectors, e.g. marketing, 
manufacturing, healthcare, media, energy, transportation (see also the use cases 
in Sect. 8.6), but also additional market segments such as software engineering. 
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A special case of visual analytics that is spearheaded by the US intelligence 


community is visualization for cyber security. Due to the nature of this market 
segment, details can be difficult to obtain; however there are publications available, 
e.g. the VizSec conferences.* 


8.5 Future Requirements and Emerging Trends for Big 


Data Usage 


This section provides an overview of future requirements and emerging trends that 
resulted from the task force’s research. 


8.5.1 Future Requirements for Big Data Usage 


As big data usage is becoming more important, there are issues on the underlying 
assumptions that become more important. The key issue is a necessary validation of 
the underlying data. The following quote as attributed to Ronald Coase, winner of 
the Nobel Prize in economics in 1991, put it as a joke alluding to the inquisition: “If 
you torture the data long enough, it [they] will confess to anything”. 


On a more serious note there are some common misconceptions in big data 


usage: 


1. 


2. 


Ignoring modelling and instead relying on correlation rather than an understand- 
ing of causation. 

The assumption that with enough—or even all (see next point)—data available, 
no models are needed (Anderson 2008). 


. Sample bias. Implicit in big data is the expectation that all data will (eventually) 


be sampled. This is rarely ever true; data acquisition depends on technical, 
economical, and social influences that create sample bias. 


. Overestimation of accuracy of analysis: it is easy to ignore false positives. 


To address these issues, the following future requirements will gain importance: 


. Include more modelling, resort to simulations, and correct (see next point) for 


sample bias. 


. Understand the data sources and the sample bias that is introduced by the context 


of data acquisition. Create a model of the real, total dataset to correct for 
sample bias. 


. Data and analysis transparency: If the data and the applied analyses are known, it 


is possible to judge what the (statistical) chances are that correlations are not 


3 http://www.vizsec.org 
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only “statistically significant” but also that the number of tested, possible 
correlations is not big enough to make the finding of some correlation almost 
inevitable. 


With these general caveats as background, the key areas that are expected to 


govern the future of big data usage have been identified: 


Data quality in big data usage 

Tool performance 

Strategic business decisions 

Human resources, big data specific positions 


The last point is exemplified by a report on the UK job market in big data 


(e-skills 2013) where demand is growing strongly. In particular, the increasing 
number of administrators sought shows that big data is growing from experimental 
status to a core business unit. 


8.5.1.1 Specific Requirements 


Some general trends are already identifiable and can be grouped into the following 
requirements: 


Use of big data for marketing purposes 
Detect abnormal events of incoming data in real time 
Use of big data to improve efficiency (and effectiveness) in core operations 


— Realizing savings during operations through real-time data availability, more 
fine-grained data, and automated processing 

— Better data basis for planning of operational details and new business 
processes 

— Transparency for internal and external (customers) purposes 


Customization, situation adaptivity, context-awareness, and personalization 
Integration with additional datasets 


— Open data 
— Data obtained through sharing and data marketplaces 


Data quality issues where data is not curated or provided under pressure, e.g., to 
acquire an account in a social network where the intended usage is anonymous 
Privacy and confidentiality issues, data access control 

Interfaces 


— Interactive and flexible, ad hoc analyses to provide situation-adaptive and 
context-aware reactions, e.g. recommendations 

— Suitable interfaces to provide access to big data usage in non-office environ- 
ments, e.g. mobile situations, factory floors, etc. 

— Tools for visualization, query building, etc. 
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¢ Discrepancy between the technical know-how necessary to execute data analysis 
(technical staff) and usage in business decisions (by non-technical staff) 

¢ Need for tools that enable early adoption. As the developments in industry are 
perceived to be accelerating, the head start from early adoption is also perceived 
as being of growing importance and a growing competitive advantage. 


8.5.1.2 Industry 4.0 


For applications of big data in areas such as manufacturing, energy, transportation, 
and even health, wherever intelligent machines are involved in the business pro- 
cess, there is a need for aligning hardware technology (i.e. machines and sensors) 
with software technology (i.e. the data representation, communication, storage, 
analysis, and control of the machinery). Future developments in embedded systems 
that are developing into “cyber-physical systems” will need to synchronize the joint 
development of hardware (computing, sensing, and networking) and software (data 
formats, operating systems, and analysis and control systems). 

Industrial suppliers are beginning to address these issues. GE software identifies 
“However well-developed industrial technology may be, these short-term and long- 
term imperatives cannot be realized using today’s technology alone. The software 
and hardware in today’s industrial machines are very interdependent and closely 
coupled, making it hard to upgrade software without upgrading hardware, and vice 
versa” (Chauhan 2013). 

On the one hand this adds a new dependency to big data usage, namely the 
dependency on hardware systems and their development and restrictions. On the 
other hand, it opens new opportunities to address more integrated systems with big 
data usage applications at the core of supporting business decisions. 


8.5.1.3 Iterative Data Streams 


There are two prominent areas of requirements for efficient and robust 
implementations of big data usage that relate to the underlying architectures and 
technologies in distributed, low-latency processing of large datasets and large data 
streams. 


¢ Pipelining and materialization: High data rates pose a special challenge for 
data stream processing. The underlying architectures are based on a pipeline 
approach where processed data can be handed to the next processing step with 
very low delay to avoid pipeline congestion. In cases where such algorithms do 
not exist, data is collected and stored before being processed. Such approaches 
are called “materialization”. Low latency for queries can typically only be 
realized in pipelining approaches. 

e Error tolerance: Fault tolerance and error minimization are an important 
challenge for pipelining systems. Failures in compute nodes are common and 
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can cause parts of the analysis result to be lost. Parallel systems must be designed 
in arobust way to overcome such faults without failing. A common approach are 
continuous check points at which intermediate results are saved, allowing the 
reconstruction of a previous state in case of an error. Saving data at checkpoints 
is easy to implement, yet results in high execution costs due to the synchroni- 
zation needs and storage costs when saving to persistent storage. New alternative 
algorithms use optimistic approaches that can recreate valid states allowing the 
continuation of computing. Such approaches add costs only in cases of errors but 
are applicable only in restricted cases. 


8.5.1.4 Visualization 


There are a number of future trends that need to be addressed in the area of 
visualization and visual analytics in the medium to far future, for example (Keim 
et al. 2010): 


e Visual perception and cognitive aspects 

e “Design” (visual arts) 

¢ Data quality, missing data, data provenance 

e Multi-party collaboration, e.g., in emergency scenarios 
e Mass-market, end user visual analytics 


In addition, Markl et al. (2013) compiled a long list of research questions from 
which the following are of particular importance to data usage and visualization: 


e How can visualization support the process of constructing data models for 
prediction and classification? 

e Which visualization technologies can support an analyst in explorative analysis? 

¢ How can audio and video (animations) be automatically collected and generated 
for visual analytics? 

e How can meta-information such as semantics, data quality, and provenance be 
included into the visualization process? 


8.5.2 Emerging Paradigms for Big Data Usage 


A number of emerging paradigms for big data usage have been identified that fall 
into two categories. The first category encompasses all aspects of integration of big 
data usage into larger business processes and the evolution towards a new trend 
called “smart data”. The second trend is much more local and concerns the interface 
tools for working with big data. New exploration tools will allow data scientists and 
analysts in general to access more data more quickly and support decision-making 
by finding trends and correlations in the dataset that can be grounded in models of 
the underlying business processes. 


158 T. Becker 


There are a number of technology trends that are emerging (e.g. in-memory 
databases) that allow for a sufficiently fast analysis to enable explorative data 
analysis and decision support. At the same time, new services are developing, 
providing data analytics, integration, and transformation of big data to organ- 
izational knowledge. 

As in all new digital markets, the development is driven in part by start-ups that 
fill new technology niches; however, the dominance of big players is particularly 
important as they have much easier access to big data. The transfer of technology to 
SMEs is faster than in previous digital revolutions; however, appropriate business 
cases for SMEs are not easy to design in isolation and typically involve the 
integration into larger networks or markets. 


8.5.2.1 Smart Data 


The concept of smart data is defined as the effective application of big data that is 
successful in bringing measurable benefits and has a clear meaning (semantics), 
measurable data quality, and security (including data privacy standards).* 

Smart data scenarios are thus a natural extension of big data usage in any 
economically viable context. These can be new business models that are made 
possible by innovative applications of data analysis, or improving the efficiency/ 
profitability of existing business models. The latter are easy to start with as data is 
available and, as it is embedded in existing business processes, already has an 
assigned meaning (semantics) and business structure. Thus, it is the added value of 
guaranteed data quality and existing metadata that can make big data usage become 
a case of smart data. 

Beyond the technical challenges, the advent of smart data brings additional 
challenges: 


1. Solving regulatory issues regarding data ownership and data privacy 
(Bitkom 2012). 

2. Making data more accessible by structuring through the addition of metadata, 
allowing for the integration of separate data silos (Bertolucci 2013). 

3. Lifting the benefits from already available open data and linked data sources. 
Their market potential is currently not fully realized (Groves et al. 2013). 


The main potential of data usage, according to Lo (2012), is found in the 
optimization of business processes, improved risk management, and market- 
oriented product development. The purpose of enhanced big data usage as smart 
data is in solving social and economical challenges in many sectors, including 
energy, manufacturing, health, and media. 


4 This section reflects the introduction of smart data as stated in a broadly supported memorandum, 
available at http://smart-data.fzi.de/memorandum/ 
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For SMEs, the focus is on the integration into larger value chains that allow 
multiple companies to collaborate to give SMEs access to the effects of scale that 
underlie the promise of big data usage. Developing such collaborations is enabled 
by smart data when the meaning of data is explicit, allowing for the combination of 
planning, control, production, and state information data beyond the limits of each 
partnering company. 

Smart data creates requirements in four areas: semantics, data quality, data 
security and privacy, and metadata. 


Semantics Understanding and having available the meaning of datasets enables 
important steps in smart data processing: 


¢ Interoperability 

¢ Intelligent processing 
¢ Data integration 

e Adaptive data analysis 


Metadata As a means to encode and store the meaning (semantics) of data. 
Metadata can also be used to store further information about data quality, prove- 
nance, usage rights, etc. Currently there are many proposals but no established 
standards for metadata. 


Data Quality The quality and provenance of data is one of the well-understood 
requirements for big data (related to one of the “Vs”, i.e. “veracity”). 


Data Security and Privacy These separate, yet related, issues are particularly 
influenced by existing regulatory standards. Violations of data privacy laws can 
easily arise from processing of personal data, e.g. movement profiles, health data, 
etc. Although such data can be enormously beneficial, violations of data privacy 
laws carry severe punishments. Other than doing away with such regulations, 
methods for anonymization (ICO 2012) and pseudonymization (Gowing and 
Nickson 2010) can be developed and used to address these issues. 


8.5.2.2 Big Data Usage in an Integrated and Service-Based 
Environment 


The continuing integration of digital services (Internet of Services), smart digital 
products (Internet of things), and production environments (Internet of Things, 
Industry 4.0) includes the usage of big data in most integration steps. A recent 
study by General Electric examined the various dimensions of integration within 
the airline industry (Evans and Annunziata 2012). Smart products like a turbine are 
integrated into larger machines, and in the first example this is an airplane. Planes 
are in turn part of whole fleets that operate in a complex network of airports, 
maintenance hangars, etc. At each step, the current integration of the business 
processes is extended by big data integration. The benefits for optimization can 
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be harvested at each level (assets, facility, fleets, and the entire network) and by 
integrating knowledge from data across all steps. 


8.5.2.3 Service Integration 


The infrastructure within which big data usage will be applied will adapt to this 
integration tendency. Hardware and software will be offered as services, all inte- 
grated to support big data usage. See Fig. 8.3 for a concrete picture of the stack of 
services that will provide the environment for “Beyond technical standards and 
protocols, new platforms that enable firms to build specific applications upon a 
shared framework/architecture [are necessary]”, as foreseen by the GE study or the 
“There is also a need for on-going innovation in technologies and techniques that 
will help individuals and organisations to integrate, analyse, visualise, and consume 
the growing torrent of big data”, as sketched by McKinsey’s study (Manyika 
et al. 2011). 

Figure 8.3 shows big data as part of a virtualized service infrastructure. At the 
bottom level, current hardware infrastructure will be virtualized with cloud com- 
puting technologies; hardware infrastructure as well as platforms will be provided 
as services. On top of this cloud-based infrastructure, software as a service (SaaS) 
and on top of this business processes as a service (BPaaS) can be built. In parallel, 
big data will be offered as a service and embedded as the precondition for knowl- 
edge services, e.g. the integration of semantic technologies for analysis of unstruc- 
tured and aggregated data. Note that big data as a service may be seen as extending 
a layer between PaaS and SaaS. 


BPaasS: KaaS: 


Business : 
Business Process Knowledge 


Designer as a Service as a Service 5 
Knowledge 2 
Worker © 
SaaS: BDaas: g 
Clerk Software Big Data oO 
as a Service as a Service g 
o Paas: S 

i 
3 Platform Application Developer 3 
9 9 as a Service 2 
Qs 9 
og J 9 
E / System Administrator = 


Virtualisation Chain Hardware © Software © Information & Knowledge (Big Data) 


Fig. 8.3 Big data in the context of an extended service infrastructure. W. Wahlster (2013, 
Personal Communication) 
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This virtualization chain from hardware to software to information and knowl- 
edge also identifies the skills needed to maintain the infrastructure. Knowledge 
workers or data scientists are needed to run big data and knowledge services. 


8.5.2.4 Complex Exploration 


Big data exploration tools support complex datasets and their analysis through a 
multitude of new approaches, e.g. Sect. 8.5.1.4 on visualization. Current methods 
for exploration of data and analysis results have a central shortcoming in that a user 
can follow their exploration only selectively in one direction. If they enter a dead 
end or otherwise unsatisfactory state, they have to backtrack to a previous state, 
much as in depth-first search or hill-climbing algorithms. Emerging user interfaces 
for parallel exploration (CITE) are more versatile and can be compared to best-first 
or beam searches: the user can follow and compare multiple sequences of explor- 
ation at the same time. 

Early instances of this approach have been developed under the name “subjunc- 
tive interfaces” (Lunzer and Hornbæk 2008) and applied to geographical datasets 
(Javed et al. 2012) and as “parallel faceted browsing” (Buschbeck et al. 2013). The 
latter approach assumes structured data but is applicable to all kinds of datasets, 
including analysis results and CEP (complex event processing). 

These complex exploration tools address an inherent danger in big data analysis 
that arises when large datasets are automatically searched for correlations: an 
increasing number of seemingly statistically significant correlations will be found 
and need to be tested for underlying causations in a model or by expert human 
analysis. Complex exploration can support the checking process by allowing a 
parallel exploration of variations of a pattern and expected consequences of 
assumed causation. 


8.6 Sectors Case Studies for Big Data Usage 


In this section an overview of case studies that demonstrate the actual and potential 
value of big data usage is presented. More details can be found in Zillner 
et al. (2013, 2014). The use cases selected here exemplify particular aspects that 
are covered in those reports. 


8.6.1 Healthcare: Clinical Decision Support 


Description Clinical decision support (CDS) applications aim to enhance the 
efficiency and quality of care operations by assisting clinicians and healthcare 
professionals in their decision-making process. CDS applications enable context- 
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dependent information access by providing pre-diagnosis information, or by vali- 
dating and correction of data. Thus, CDS systems support clinicians in informed 
decision-making, which again helps to reduce treatment errors as well as helps to 
improve efficiency. 

By relying on big data technology, future clinical decisions support applications 
will become substantially more intelligent. An example use case is the 
pre-diagnosis of medical images, with treatment recommendations reflecting 
existing medical guidelines. 

The core prerequisite is the comprehensive data integration and the very high 
level of data quality necessary for physicians to actually rely on automated decision 
support. 


8.6.2 Public Sector: Monitoring and Supervision of Online 
Gambling Operators 


Description This future scenario represents a clear need. The main goal involved 
is fraud detection that is hard to execute as the amount of data received in real time, 
on a daily and monthly basis, cannot be processed with standard database tools. 
Real-time data is received from gambling operators every five minutes. Currently, 
supervisors have to define the cases on which to apply offline analysis of 
selected data. 

The core prerequisite is a need to explore data interactively, compare different 
models and parameter settings based on technology, e.g. complex event processing 
that allows the real-time analysis of such a dataset. This use case relates to the 
issues on visual analytics and exploration, and predictive analytics. 


8.6.3 Telco, Media, and Entertainment: Dynamic Bandwidth 
Increase 


Description The introduction of new Telco offerings (e.g. a new gaming appli- 
cation) can cause problems with bandwidth allocations. Such scenarios are of 
special importance to telecommunication providers, as more profit is made with 
data services than with voice services. In order to pinpoint the cause of bandwidth 
problems, transcripts of call-centre conversations can be mined to identify cus- 
tomers and games involved with timing information, putting into place infrastruc- 
ture measures to dynamically change the provided bandwidth according to usage. 

The core prerequisites are related to predictive analysis. If problems can be 
detected while they are building up, peaks can be avoided altogether. Where the 
decision support can be automated, this scenario can be extended to prescriptive 
analysis. 
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8.6.4 Manufacturing: Predictive Analysis 


Description Where sensor data, contextual and environmental data, is available, 
possible failures of machinery can be predicted. The predictions are based on 
abnormal sensor values that correspond to functional models of failure. Further- 
more, context information such as inferences on heavy or light usage depending on 
the tasks executed (taken, e.g. from an ERP system) and contributing information 
such as weather conditions, etc., can be taken into account. 

The core prerequisites, besides classical requirements such as data integration 
from the various, partially unstructured, data sources, are transparent prediction 
models and sufficiently large datasets to enable the underlying machine-learning 
algorithms. 


8.7 Conclusions 


This chapter provides state of the art as well as future requirements and emerging 
trends of big data usage. 

The major uses of big data applications are in decision support, in predictive 
analytics (e.g. for predictive maintenance), and in simulation and modelling. New 
trends are emerging in visualization (visual analytics) and new means of explor- 
ation and comparison of alternate and competing analyses. 

A special area of use cases for big data is the manufacturing, transportation, and 
logistics sector with a new trend “Industry 4.0”. The emergence of cyber-physical 
systems for production, transportation, logistics, and other sectors brings new 
challenges for simulation and planning, for monitoring, control, and interaction 
(by experts and non-experts) with machinery or big data usage applications. On a 
larger scale, new services and a new service infrastructure are required. Under the 
title “smart data” and smart data services, requirements for data and also service 
markets are formulated. Besides the technology infrastructure for the interaction 
and collaboration of services from multiple sources, there are legal and regulatory 
issues that need to be addressed. A suitable service infrastructure is also an 
opportunity for SMEs to take part in big data usage scenarios by offering specific 
services, e.g., through data service marketplaces. 


Open Access This chapter is distributed under the terms of the Creative Commons Attribution- 
Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any 
noncommercial use, distribution, and reproduction in any medium, provided the original author(s) 
and source are credited. 

The images or other third party material in this book are included in the work’s Creative 
Commons license, unless indicated otherwise in the credit line; if such material is not included in 
the work’s Creative Commons license and the respective action is not permitted by statutory 
regulation, users will need to obtain permission from the license holder to duplicate, adapt, or 
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Part II 
Usage and Exploitation of Big Data 


Chapter 9 
Big Data-Driven Innovation in Industrial 
Sectors 


Sonja Zillner, Tilman Becker, Ricard Munné, Kazim Hussain, 
Sebnem Rusitschka, Helen Lippell, Edward Curry, and Adegboyega Ojo 


9.1 Introduction 


Regardless of what form it takes, data has the potential to tell stories, identify cost 
savings and efficiencies, new connections and opportunities, and enable improved 
understanding of the past to shape a better future (US Chamber of Commerce 
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Foundation 2014). Big data connotes the enormous volume of information includ- 
ing user-generated data from social media platforms (i.e. Internet data); machine, 
mobile, and GPS data as well as the Internet of Things (industrial and sensor data); 
business data including customer, inventory, and transactional data (enterprise 
data); datasets generated or collected by government agencies, as well as universi- 
ties and non-profit organizations (public data) (US Chamber of Commerce Foun- 
dation 2014). For many businesses and governments in different parts of the world, 
techniques for processing and analysing these large volumes of data (big data) 
constitute an important resource for driving value creation, fostering new products, 
processes, and markets, as well as enabling the creation of new knowledge (OECD 
2014). In 2013 alone, the data-driven economy added an estimated $67 billion in 
new value to the Australian economy, equivalent to 4.4 % of its gross domestic 
product or the whole of its retail sector (Stone and Wang 2014). 

As a source of economic growth and development, big data constitutes an 
infrastructural resource that could be used in several ways to produce different 
products and services. It also enables creation of knowledge that is vital for 
controlling natural phenomenon, social systems, or organizational processes and 
supports complex decision-making (OECD 2014). In this vein, the international 
development community and the United Nations are seeking political support at the 
highest levels on harnessing data-driven innovations to support sustainable devel- 
opment, particularly under the new global Sustainable Development Goals (SDGs) 
(Independent Expert Advisory Group on Data Revolution 2014). Similarly, cities 
like Helsinki, Manchester, Amsterdam, Barcelona, and Chicago are leveraging 
big and open data from open sensor networks, public sector processes, and 
crowdsourced social data to improve mobility, foster co-creation of digital public 
services, and in general enable better city intelligence to support more effective city 
planning and development (Ojo et al. 2015). 

At the same time, there is a growing understanding of the challenges associated 
with the exploitation of big data in society. These challenges range from paucity of 
requisite capacity (e.g. data literacy) to ethical dilemma in handling big data and 
how to incentivize the participation of other critical stakeholders in adopting and 
leveraging big data-driven innovation to tackle societal challenges (Hemerly 2013; 
Insight Centre for Data Analytics 2015). 

This chapter describes what is involved in big data-driven innovation, provides 
examples of big data-driven innovations across different sectors, and synthesizes 
enabling factors and challenges associated with the development of a big data 
innovation ecosystem. The chapter closes by offering practical (policy) recommen- 
dations on how to develop viable big data innovation programs and initiatives. 
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9.2 Big Data-Driven Innovation 


Innovation is an iterative process aimed at the creation of new products, processes, 
knowledge, or services by the use of new or even existing knowledge (Kusiak 
2009). Data-driven innovation entails exploitation of any kind of data in the 
innovation process to create value (Stone and Wang 2014). The emerging trend 
of big data-driven innovation is leading to the development of data-driven goods 
and services and can enable data-driven planning, data-driven marketing, and data- 
driven operations across all industrial sectors and domains. From the economic 
perspective, data as a non-rivalrous good or commons such as oil serves an 
infrastructural resource (from a functional perspective) that could be exploited 
simultaneously by many users or actors for different competing or complementary 
ends. The demand for data in this sense according to the OECD (2014) is driven 
primarily by downstream productive activities that require data as an input and, in 
fact, a non-trivial capital. In addition, the same authors assert that data resources 
may be used as input into a wide variety of goods, including private, public, and 
social goods. In other words, big data potentially offers significant returns to scale 
and scope. 

Big data-driven innovations are implicitly associated with a value chain model 
or more precisely a “virtual value chain” specifying how the data of interest will be 
gathered, organized, selected, transformed into products or services, and distributed 
(Rayport and Sviokla 1995; Piccoli 2012). Big data value chains as discussed in 
Chap. 3 are at the core of delivering data-driven innovation using big data technol- 
ogy. At the organizational level, at least two categories of strategic initiatives could 
result from big data-driven innovation and its underlying big data value chain. The 
first category of initiatives aims to make information available on aspects of 
organizational processes and services to enable improvements. In general, by 
instrumenting organizational operations, large amounts of data (i.e. big data) are 
generated that inform or drive required changes (Piccoli 2012). The second set of 
initiatives is external facing and involves exploitation of customer data such as 
search and user logs, transaction records, and other customer-generated contents to 
drive long-tail marketing, targeted and personalized recommendation, increased 
sale, and customer satisfaction. A popular example of this is Netflix’s collaborative 
filtering algorithm to predict user movie ratings (Chen and Storey 2012). Yet 
another example is Google’s use of users search behaviour to target advertising 
(US Chamber of Commerce Foundation 2014). 

In the United States, hundreds of companies are utilizing open and big data (such 
as weather and GPS data) as key resources to generate value across different sectors 
including finance and investment, education, environment and weather, housing 
and real estate, and food and agriculture (US Chamber of Commerce Foundation 
2014). The next section elaborates on a number of data-driven transformations 
across different sectors including telecommunication, healthcare, public sector, 
finance and insurance, media and entertainment, energy, and transport. 
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9.3 Transformation in Sectors 


The BIG Project examined how big data technologies can enable business innova- 
tion and transformation within different sectors by gathering big data requirements 
from vertical industrial sectors, including health, public sector, finance, insurance, 
telecom, media, entertainment, manufacturing, retail, energy, and transport. There 
are a number of challenges that need to be addressed before big data-driven 
innovation is generally adopted. Big data can only succeed in driving innovation 
if a business puts a well-defined data strategy in place before it starts collecting and 
processing information. Obviously, investment in technology requires a strategy to 
use it according to commercial expectations; otherwise, it is better to keep current 
systems and procedures. Organizations within many sectors are now beginning to 
take the time to understand where this strategy should take them. 

The full results of this analysis are available in Zillner et al. (2014). Part III of 
this book provides a concise summary of the key findings from a selected number of 
sectors. The remainder of this chapter provides an executive summary of the 
findings from each sector together with discussion and analysis. 


9.3.1 Healthcare 


Investigation of the healthcare sector in Chap. 10 revealed several developments, 
such as escalating healthcare costs, increased need for healthcare coverage, and 
shifts in provider reimbursement trends, which have triggered the demand for big 
data technology. In the sector the availability and access of health data is contin- 
uously improving, the required big data technology (such as advanced data inte- 
gration and analytics technologies) are in place, and first-mover best-practice 
applications have demonstrated the potential of big data technology. However, 
the big data revolution in the healthcare domain is in a very early stage with the 
most potential for value creation and business development unclaimed as well as 
unexplored. Current roadblocks to big data-driven innovation are the established 
system incentives of the healthcare system that hinders collaboration and, thus, data 
sharing and exchange. The trend towards value-based healthcare delivery will 
foster the collaboration of stakeholders to enhance the value of the patient’s 
treatment, and thus will significantly foster the need for big data applications. 


9.3.2 Public Sector 


The investigation of the public sector in Chap. 11 showed that the sector is facing 
some important challenges—the lack of productivity compared to other sectors, 
budgetary constraints, and other structural problems due to the aging population 
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that will lead to an increasing demand for medical and social services, together with 
the foreseen lack of a young workforce in the future. 

The public sector is increasingly aware of the potential value to be gained from 
big data-driven innovation via improvements in effectiveness and efficiency and 
with new analytical tools. Governments generate and collect vast quantities of data 
through their everyday activities, such as managing pensions and allowance pay- 
ments, tax collection, etc. The main requirements, mostly non-technical, from the 
public sector are: 


(i) Interoperability: An obstacle to exploit data assets due to the fragmentation of 
data ownership and the resulting data silos. 

(ii) Legislative support and political willingness: The process of creating new 
legislation is often too slow to keep up with fast-moving technologies and 
business opportunities. 

(iii) Privacy and security issues: The aggregation of data across administrative 
boundaries in a non-request-based manner is a real challenge. 

(iv) Big data skills: Besides technical people, there is a lack of knowledge regard- 
ing the potential of big data in business-oriented people. 


9.3.3 Finance and Insurance 


As covered in Chap. 12 the finance and insurance sector is the clearest example of a 
data-driven industry. Big data represents a unique opportunity for most banking and 
financial services organizations to leverage their customer data to transform their 
business, realize new revenue opportunities, manage risk, and address customer 
loyalty. However, similarly to other emerging technologies, big data inevitably 
creates new challenges and data disruption for an industry already faced with 
governance, security, and regulatory requirements, as well as demands from the 
increasingly privacy-aware customer base. 

At this moment not all finance companies are prepared to embrace big data, 
legacy information infrastructure, and organizational factors being the most signif- 
icant barriers for its wide adoption in the sector. The deployment of big data 
solutions must be aligned with business objectives for a successful adoption of 
the technology to return the maximum business value. 


9.3.4 Energy and Transport 


Chapter 13 examines the sectors of energy and transport which from an infrastruc- 
ture perspective, as well as from resource efficiency and quality of life perspectives, 
are very important for Europe. The high quality of the physical infrastructure and 
global competitiveness of the stakeholders needs to be maintained with respect to 
the digital transformation and big data-driven innovation. 
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The analysis of the available data sources in energy as well as their use cases in 
the different categories for big data value: operational efficiency, customer experi- 
ence, and new business models make it clear that a mere utilization of existing big 
data technologies as employed by the online data businesses will not be sufficient. 
Domain- and device-specific adaptations are necessary for use in the cyber-physical 
systems of oil, gas, electrical, and transport. Innovation regarding privacy and 
confidentiality preserving data management and analysis is a primary concern of 
all energy and transport stakeholders that are dealing with customer data, be it 
business-to-consumer or business-to-business. Without satisfying the need for 
privacy and confidentiality, there will always be uncertainty around regulation 
and customer acceptance of new data-driven offering. 

The increasing intelligence embedded in the infrastructures will enable the “in- 
field” analysis of the data to deliver “smart data”. This seems to be necessary, since 
the analytics involved will require much more elaborate algorithms than for other 
sectors such as retail. Additionally, the stakes are very high since the optimization 
opportunities are within critical infrastructures. 


9.3.5 Media and Entertainment 


The media and entertainment industries have frequently been at the forefront of 
adopting new technologies. Chapter 14 details the key business problems that are 
driving media companies to look at big data-driven innovation as a way to reduce 
the costs of operating in an increasingly competitive landscape, and at the same 
time, the need to increase revenue from delivering content. It is no longer sufficient 
to publish a newspaper or broadcast a television programme—contemporary oper- 
ators must drive value from their assets at every stage of the data lifecycle. 

Media players are also more connected with their customers and competitors 
than ever before—thanks to the impact of disintermediation, content can be gener- 
ated, shared, curated, and republished by literally anyone. This means that the 
ability of big data technologies to ingest and process many different data sources, 
and if required even in real-time, is a valuable asset companies are prepared to 
invest in. 

As with the telecom industry, the legal and regulatory aspects of operating 
within Europe cannot be disregarded. As one example, it is critical that just because 
it is technically possible to accumulate vast amounts of detail about customers from 
their service usage, call centre interactions, social media updates, and so on, it does 
not mean that it is ethical to do so without being transparent about how the data will 
be used. Europe has much stronger data protection rules than the United States, 
meaning that individual privacy and global competitiveness will need to be 
balanced. 
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9.3.6 Telecommunication 


The telecom sector seems to be convinced of the potential of big data technologies. 
The combination of benefits within marketing and offer management, customer 
relationship, service deployment, and operations can be summarized as the achieve- 
ment of the operational excellence for telecom players. 

There are a number of emerging big data telecom-specific commercial platforms 
available in the market that provide dashboards, reports to assist decision-making 
processes, and can be integrated with business support systems (BSS). Automatic 
actuation on the network as a result of the analysis is yet to come. Besides these 
platforms, Data as a Service (DaaS) is a trend some operators are following, which 
consists of providing companies and public sector organizations with analytical 
insights that enable third parties to become more effective. 

Another very important factor within the sector is related to policy. The 
Connected Continent framework, aimed at benefiting customers and fostering the 
creation of the required infrastructure for Europe to become a connected commu- 
nity, at first sight, will most probably result in more strict regulations for telco 
players. A clear and stable framework is very important to foster investment in 
technology, including big data solutions. 


9.3.7 Retail 


The retail sector will be dependent on the collection of in-store data, product data, 
and customer data. To be successful in the future, retailers must have the ability to 
extract the right information out of huge data collections acquired in instrumented 
retail environments in real time. Existing business intelligence for retail analytics 
must be reorganized to understand customer behaviour and to be able to build more 
context-sensitive, consumer- and task-oriented recommendation tools for retailer- 
consumer dialog marketing. 


9.3.8 Manufacturing 


The core requirements in the manufacturing sector are the customization of prod- 
ucts and production—‘lot size one”—the integration of production in the larger 
product value chain, and the development of smart products. 

The manufacturing industry is undergoing radical changes with the introduction 
of IT technology on a large scale. The developments under “Industry 4.0” include a 
growing number of sensors and connectivity in all aspects of the production 
process. Thus, data acquisition is concerned with making the already available 
data manageable, i.e., standardization and data integration are the biggest 
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requirements. Data analysis is already applied in intra-mural applications and will 
be required for more integrated applications that cover complete logistics chains 
across factories in the production chain and even into the post-sale usage of (smart) 
products. Production planning needs to be supported by data-based simulation of 
these complete environments. 

Complex and smart machinery, e.g., airplane engines, can benefit from big data- 
based predictive maintenance where sensor and context information is used with 
machine learning algorithms to avoid unnecessary maintenance and to schedule 
protective repairs when failures are predicted. Given the additional infrastructure 
costs, manufacturers are using new business models where machinery is leased and 
not sold; and in turn sensor data and services are owned and executed by the 
manufacturer and not the user of machinery. This leads to challenges in regulations 
and contracts concerning data ownership. 

The European manufacturing sector can be both a market leader using big data in 
the context of Industry 4.0, and a leading market, where manufacturing big data is 
integrated in the larger product value chain and smart products can be put to use. 


9.4 Discussion and Analysis 


The analysis of the key findings across the sectors indicates that it is important to 
distinguish the technical from the business perspective. From a technological 
perspective, big data applications represent an evolutionary step. Big data technol- 
ogies, such as decentralized networking and distributed computing for scalable data 
storage and scalable data analytics, semantic technologies and ontologies, machine 
learning, natural language processing, and other data mining techniques have been 
the focus of research projects for many years. Now these techniques are being 
combined and extended to address the technical challenge faced in the big data 
paradigm. 

When analysed from the business perspective, it becomes clear that big data 
applications have a revolutionary—sometimes even disruptive—impact on the 
existing industrial business-as-usual practices. If thought through: new players 
emerge that are better suited to offer services based on mass data. Underlying 
business processes change fundamentally. For instance in the healthcare domain, 
big data technologies can be used to produce new insight about the effectiveness of 
treatments and this knowledge can be used to increase quality of care. However, in 
order to benefit from the value of these big data applications, the industry requires 
new reimbursement models that reward the quality instead of quantity of treat- 
ments. Similar changes are required in the energy industry: energy usage data from 
end users would have benefits for multiple stakeholders such as energy retailers, 
distribution network operators, and new players such as demand response providers 
and aggregators, energy efficiency service providers. But who is to invest in the 
technologies that would harvest the energy data in the first place? New participatory 
business value networks are required instead of static value chains. 
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Within all industries the 3 Vs of big data, volume, velocity, and variety, have 
been of relevance. In addition, industrial sectors that are already reviewing them- 
selves in the light of the big data era add further Vs to reflect sectorial-specific 
aspects and to adapt the big data paradigm to their particular needs. Many of those 
extensions, such as data privacy, data quality, data confidentially, etc., address the 
challenge of data governance, while other extensions, such as value, address the 
fact that the potential business value of big data applications is yet unexplored and 
may not be well understood within the sector. 

Within all industrial sectors it became clear that it was not the availability of 
technology, but the lack of business cases and business models that is hindering the 
implementation of big data. Usually, a business case needs to be clearly defined and 
convincing before investment is made in new applications. However, in the context 
of big data applications, the development of a concrete business case is a very 
challenging task. This is due to two reasons. First, as the impact of big data 
applications relies on the aggregation of not only one but also a large variety of 
heterogeneous data sources beyond organizational boundaries, the effective coop- 
eration of multiple stakeholders with potentially diverging or at first orthogonal 
interests is required. Thus, the stakeholders’ individual interests and constraints— 
which in addition are quite often moving targets—need to be reflected within the 
business case. Second, existing approaches for developing business models and 
business cases usually focus on single organizations and do not provide guidance 
for dynamic value networks of multiple stakeholders within a digital single market. 


9.5 Conclusion and Recommendations 


Data-driven innovation has the potential to impact all sectors of the economy. 
However to realize these, potential policymakers need to develop coherent policies 
for the use of data. This could be achieved by: (1) supporting education that focuses 
on data science skills, (2) removing the barriers to create a digital single market, 
(3) stimulating the necessary investment environment needed for big data technol- 
ogy, (4) making public data accessible through open data and removing data silos, 
(5) providing competitive technical infrastructure, and (6) promoting balanced 
legislation, and at the same time, policy must address issues such as privacy and 
security, ownership and transfer, and infrastructure and data civics (Hemerly 2013). 
In this vein, there are calls for a magna carta for data to address questions on how 
big data technologies could facilitate discrimination and marginalization; how to 
ensure that contracts between individuals and powerful big data companies or 
governments are fair; and where to situate the responsibility for the security of 
data (Insight Centre for Data Analytics 2015). In our opinion, further and sustain- 
able progress in big data-driven innovation is contingent on actions by governments 
in collaboration with other major stakeholders in developing the right policy and 
regulatory environment based on empirical evidences from systematic research 
around some of the questions advanced above. 
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Chapter 10 
Big Data in the Health Sector 


Sonja Zillner and Sabrina Neururer 


10.1 Introduction 


Several developments in the healthcare sector, such as escalating healthcare costs, 
increased need for healthcare coverage, and shifts in provider reimbursement 
trends, trigger the demand for big data technologies in order to improve the overall 
efficiency and quality of care delivery. For instance, the McKinsey Company 
(2011) Study indicates a high financial impact of big data applications in the 
healthcare domain, of the order of a $300 billion value per year solely for the US. 
Similarly impressive numbers are provided by IBM: within the Executive Report of 
IBM Global Business Services (Korster and Seider 2010), the authors describe the 
healthcare system as highly inefficient, that is, approximately US$ 2.5 trillion is 
wasted annually and efficiency can be improved by 35 %. This is in comparison to 
other industries the largest opportunity for efficiency improvements. Moreover, 
major players are investing in the growth market of medicine for an aging popula- 
tion, for instance Google founded a new company Calico to tackle age-related 
health problems. In conclusion, big data applications in healthcare have high 
future potential and opportunities. 

However, to the best of our knowledge, only a limited number of implemented 
big data based application scenarios can be found today. Although non-advanced 
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healthcare analytics applications—such as analytics for improved accounting, 
quality control, or clinical research—are available in a widespread manner, these 
applications do not make use of the potential of big data technologies. This is 
mainly due to the fact that health data cannot be easily accessed. High investment 
and effort is needed to enable efficient health data management and seamless health 
data access as the foundation for big data applications. As a consequence, convinc- 
ing business cases are difficult to identify as the burden of the initial investment 
strongly reduces any profit expectations. In other words, one of the biggest chal- 
lenges in the healthcare domain for the realization of big data applications is the fact 
that high investments, standards, and frameworks as well as new supporting 
technologies are needed in order to make health data available for subsequent big 
data analytics applications. Thus, the efficient management and integration of 
health data is a key requirement for big data applications in the healthcare domain 
that needs to be addressed. 

The investigations (Zillner et al. 2014a, b) in this chapter found that the highest 
impact of big data applications in the healthcare domain is expected when it 
becomes possible to not only rely on one single, but various data sources such 
that different aspects from the various domains can be related. Therefore, the 
availability and integration of all related health data sources, such as clinical data, 
claims, cost and administrative data, pharmaceutical and research data, patient 
monitoring data, as well as the health data on the web, is of high relevance. 

Health data is a form of “big data” not only because of the sheer volume but also 
for its complexity, diversity, and timeliness. Although large volume of structured 
data is already available today, the volume of unstructured data, such as biometric 
data, text reports, and medical images, will eclipse the whole data volume require- 
ments. This is in close relation to the challenge of handling the high variety of 
health data, i.e. not only very heterogeneous data, such as images, structured 
reports, unstructured notes, etc., require new forms of (pre-) processing but also 
the semantics of its various domains, such as financial, administrative, research, 
patient or public health, needs to be reflected. The value of big data applications 
relies on the identification of convincing business cases. As the impact and success 
of healthcare business cases rely on the cooperation of multiple stakeholders with 
often diverging points of interests, they become challenging to identify. 


10.2 Analysis of Industrial Needs in the Health Sector 


The interviews and investigation in this section show that the high-level require- 
ments of increased efficiency and quality of healthcare of today are often seen as 
opposing. The majority of high-quality health services rely on the analysis of larger 
amounts of data and content. This automatically leads to increased cost of care 
given that the means for automatic analysis of data, such as big data technologies, 
are still missing. However, with big data analytics, it becomes possible to segment 
the patients into groups and subsequently determine the differences between patient 


10 Big Data in the Health Sector 181 


groups. Instead of asking the question “Is the treatment effective?”, it becomes 
possible to answer the question “For which patient is this treatment effective?” This 
shift from average-based towards individualized healthcare bears the potential to 
significantly improve the overall quality of care in an efficient manner. Conse- 
quently, any information that could help to improve both the quality and the 
efficiency of healthcare at the same time was indicated as most relevant and useful. 

High impact insights can only be realized if the data analytics is accomplished 
on heterogeneous datasets encompassing data from the clinical, administrative, 
financial, and public domain. This requires that the various stakeholders owning! 
the data are willing to share their data assets. However, there is a strong competition 
between the involved stakeholders of the healthcare industry. It is a competition for 
resources and the resources are limited. Each stakeholder is focused on their own 
financial interests, which often leads to sub-optimal treatment decisions. Conse- 
quently, the patient is currently the one who is suffering most. The interests and 
roles of the various stakeholder groups can be summarized as follows: 


e Patients have interest in affordable, high quality, and broad coverage of 
healthcare. As of today, only very limited data about the patient’s health 
conditions is available and patients have only very limited opportunities to 
actively engage in the process. 

¢ Hospital operators are trying to optimize their income from medical treat- 
ments, i.e. they have a strong interest in improved efficiency of care, such as 
automated accounting routines, improved processes, or improved utilization of 
resources. 

e Clinicians and physicians are interested in more automated and less labour- 
intensive routine processes, such as coding tasks, in order to have more time 
available for and with the patient. In addition, they are interested in accessing 
aggregated, analysed, and concisely presented health data that enables informed 
decision-making and high quality treatment decisions. 

e Payors, such as governmental or private healthcare insurers. As of today, the 
majority of current reimbursement systems manage fee-for-service or Diagnose- 
related Group (DRG) based payments using simple IT-negotiation and data 
exchange processes between payors and healthcare providers and do not rely 
on data analytics. As payors are deciding which health services (i.e. which 
treatment, which diagnosis, or which preventative test) will be covered or not, 
their position and influence regarding the adoption of innovative treatments and 
practices is quite powerful. However, currently only limited and fragmented data 
about the effectiveness and value of health services is available; the reasons for 
treatment coverage often remain unclear and sometimes seem to be arbitrary. 

e Pharmaceuticals, life science, biotechnology, and clinical research: Here the 
discovery of new knowledge is the main interest and focus. As of today, the 


' The concept of data ownership influences how and by whom the data can be used. Thus, the term 
“ownership of data” is referred to both the possession of and responsibility for information, that is, 
the term “ownership of data” implies power as well as control. 
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various mentioned domains are mainly unconnected and accomplish their data 
analytics on single data sources. By integrating heterogeneous and distributed 
data sources, the impact of data analytic solutions is expected to increase 
significantly in the future. 

¢ Medical product providers are interested in accessing and analysing clinical 
data in order to learn about their own products performance in comparison to 
competitors’ products in order to increase revenue and/or improve the own 
market position. 


To transform the current healthcare system into a preventative, pro-active, and 
value-based system, the seamless exchange and sharing of health data is needed. 
This again requires effective cooperation between stakeholders. However, today 
the healthcare setting is mainly determined by incentives that hinder cooperation. 
To foster the implementation and adaption of comprehensive big data applications 
in the healthcare sector, the underlying incentives and regulations defining the 
conditions and constraints under which the various stakeholders interact and coop- 
erate need to be changed. 


10.3 Potential Big Data Applications for Health 


Analysis of the health sector (Zillner et al. 2014b) shows that several big data 
application scenarios exist that aim towards aligning the need of improved quality, 
which in general implies increased cost of care, with the need of improved effi- 
ciency of care. Common to all identified big data applications is the fact that they all 
require a means to semantically describe and align various heterogeneous data 
sources, means to ensure high data quality, means that address data privacy and 
security, as well as means for data analytics on integrated datasets. 

For example, Public Health Analytics applications demonstrate the potential 
opportunities as well as associated technical requirements that are associated with 
big data technologies. Public health applications rely on the management of 
comprehensive and longitudinal health data from chronic (e.g. diabetes, congestive 
heart failure) or severe (e.g. cancer) diseases from the specific patient population in 
order to aggregate and analyse treatment and outcome data. Gained insights are 
very valuable as they help to reduce complications, slow disease progression, as 
well as improve treatment outcome. For instance, since 1970 Sweden is continu- 
ously investing in public health analytic initiatives leading to 90 registries 
that cover today 90 % of all Swedish patient data with selected characteristics 
(some cover even longitudinal data) (Soderland et al. 2012). A related study 
(PricewaterhouseCoopers (2009)) showed that Sweden has the best healthcare 
outcomes in Europe by average healthcare costs (9 % of the gross domestic product 
(GDP)). In order to achieve this, health data (which is stored in structured (e.g. lab 
reports) as well as unstructured data (e.g. medical reports, medical images)) need to 
be semantically enriched (Semantic Data Enrichment) in order to make the implicit 
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semantics of health data understandable across the involved organizations and 
stakeholders. In addition, a common infrastructure with common standards 
allowing for seamless data sharing (Data Sharing) as well as for the physical 
integration of multiple data sources into one platform (Data Integration) are 
needed. In order to be compliant to the high data security and privacy requirements 
that are needed to protect the sensitive nature of longitudinal health data, common 
legal frameworks as well as technical means for data anonymization need to be in 
place (Data Security and Privacy). Moreover, in order to ensure the comparability 
of health datasets, processes ensuring high data quality through the standardized 
documentation as well as systematic analysis of health and outcome data of the 
specific patient population are required (Data Quality). 

In terms of data handling, the other identified application scenarios yield very 
similar technical requirements. For instance, Comparative Effectiveness Research 
applications aim to compare the clinical and financial effectiveness of interventions 
in order to increase the efficiency and quality of clinical care services. To achieve 
this, large datasets encompassing clinical data (information about patient charac- 
teristics), financial data (cost data), and administrative data (treatments and services 
accomplished) are critically analysed in order to identify the clinically most effec- 
tive, as well as most cost-effective treatments that work best for particular patients. 

Clinical Operation Intelligence applications aim to identify waste in clinical 
processes in order to optimize them accordingly. By analysing medical procedures, 
performance opportunities, such as improved clinical processes, fine-tuning, and 
adaptation of clinical guidelines, can be realized. Other examples are Clinical 
Decision Support (CDS) applications seeking to enhance the efficiency and quality 
of care operations by assisting clinicians and healthcare professionals in their 
decision-making process by enabling context-dependent information access, by 
providing pre-diagnostic information or by validating and correcting the data 
provided. A further category of scenarios are applications addressing the Secondary 
Usage of Health Data that rely on the aggregation, analysis, and concise presenta- 
tion of clinical, financial, administrative, as well as other related health data in order 
to discover new valuable knowledge, for instance, to identify trends, predict out- 
comes, or to influence patient care, drug development, and therapy choices. Finally, 
Patient Engagement Applications focus on establishing a platform/patient portal 
that fosters active patient engagement in healthcare processes. Any health apps that 
run on top of the patient platform rely on the integration of episodic health data 
from clinical settings as well as non-episodic data captured by devices to monitor 
health-related parameters, such as activity, diet, sleep, or weight. 


10.4 Drivers and Constraints for Big Data in Health 


The successful realization of big data in health has several drivers and constraints. 
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10.4.1 Drivers 


The following drivers were identified for big data in the health sector: 


¢ Increased volume of electronic health data: With the increasing adoption of 
electronic health record (EHR) technology (which is already the case in the 
USA), and the technological progress in the area of next generation sequencing 
and medical image segmentation, more and more health data will be available. 

¢ Need for improved operational efficiency: To address greater patient volumes 
(aging population) and to reduce very high healthcare expenses, transparency of 
the operational efficiency is needed. 

e Value-based healthcare delivery: Value-based healthcare relies on the align- 
ment of treatment and financial success. In order to gain insights about the 
correlation between effectiveness and cost of treatments, data analytics solutions 
on integrated, heterogeneous, complex, and large sets of healthcare data are 
demanded. 

e US legislation: The US Healthcare Reform, also known as Obamacare, fosters 
the implementation of EHR technologies as well as health data analytics. These 
have a significant impact on the international market for big health data 
applications. 

+ Increased patient engagement: Applications such as “PatientsLikeMe” dem- 
onstrate the willingness of patients to actively engage in the healthcare process. 

e New incentives: The current system incentives enforce “high number” instead 
of “high quality” of treatments. Although it is obvious that nobody wants to pay 
for treatments that are ineffective, this is still the case in many medical systems. 
In order to avoid low-quality reimbursements, the incentives of the medical 
systems need to be aligned with outcomes. Several initiatives, such as Account- 
able Care Organizations (ACO) (Centers for Medicare and Medicaid Services 
2010), or Diagnose-related Groups (DRG) (Ma Ching-To Albert 1994), have 
been implemented in order to reward quality instead of quantity of treatments. 


10.4.2 Constraints 


The constraints for big data in the health sector can be summarized as follows: 


¢ Digitalization of health data: Only a small percentage of health-related data is 
available in digital format. 

e Lack of standardized health data: The seamless sharing of data requires that 
health data across hospitals and patients needs to be captured in a unified 
standardized way. 


2 http://www. patientslikeme.com/ 
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Data silos: Healthcare data is often stored in distributed data silos, which makes 
data analytics cumbersome and unstable. 

Organizational silos: Due to missing incentives, cooperation across different 
organizations, and sometimes even between departments within one organiza- 
tion, is rare and exceptional. 

Data security and privacy: Legal frameworks defining data access, security, 
and privacy issues and strategies are missing, hindering the sharing and 
exchange of data. 

High investments: The majority of big data applications in the healthcare sector 
rely on the availability of large-scale, high-quality, and longitudinal healthcare 
data. The collection and maintenance of such comprehensive data sources 
requires not only high investments, but also time (years) until the datasets are 
comprehensive enough to produce good analytical results. 

Missing business cases and unclear business models: Any innovative tech- 
nology that is not aligned with a concrete business case, including associated 
responsibilities, is likely to fail. This is also true for big data solutions. Hence, 
the successful implementation of big data solutions requires transparency about: 
(a) who is paying for the solution, (b) who is benefiting from the solution, and 
(c) who is driving the solution. For instance, the implementation of data analyt- 
ics solutions using clinical data requires high investments and resources to 
collect and store patient data, i.e. by means of an electronic health record 
(EHR) solution. Although it seems to be obvious how the involved stakeholder 
could benefit from the aggregated datasets, it remains unclear whether the 
stakeholder would be willing to pay for, or drive, such an implementation. 


10.5 Available Health Data Resources 


The healthcare system has several major pools of health data that are held by 
different stakeholders/parties: 


Clinical data, which is owned by the provider (such as hospitals, care centres, 
physicians, etc.) and encompasses any information stored within the classical 
hospital information systems or EHR, such as medical records, medical images, 
lab results, genetic data, etc. 

Claims, cost, and administrative data, which is owned by the provider and the 
payors and encompasses any datasets relevant for reimbursement issues, such as 
utilization of care, cost estimates, claims, etc. 

Research data, which is owned by the pharmaceutical companies, research labs/ 
academia, and government and encompasses clinical trials, clinical studies, 
population and disease data, etc. 

Patient monitoring data, which is owned by patients or monitoring device 
producers and encompasses any information related to patient behaviours and 
preferences. 
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¢ Health data on the web: websites such as “PatientsLikeMe” are getting more 
and more popular. By voluntarily sharing data about rare diseases or remarkable 
experiences with common diseases, their communities and users are generating 
large sets of health data with valuable content. 


The improvement of quality of care can be addressed if the various dimensions 
of health data are incorporated in the automated health data analysis. The data 
dimensions encompass (a) the clinical data describing the health status and history 
of a patient, (b) the administrative and clinical process data, (c) the knowledge 
about diseases as well as related (analysed) population data, and (d) the knowledge 
about changes. If the data analysis is restricted to only one data dimension, for 
example, the administrative and financial data, it will be possible to improve the 
already established management and reimbursement processes; however it will not 
be possible to identify new standards for individualized treatments. Hence, the 
highest clinical impact of big data approaches for the healthcare domain can be 
achieved if data from the four dimensions are aggregated, compared, and related. 

As each data pool is held by different stakeholders/parties, the data in the health 
domain is highly fragmented. However, the integration of the various heteroge- 
neous datasets is an important prerequisite of big health data applications and 
requires the effective involvement and interplay of the various stakeholders. There- 
fore, adequate system incentives, which support the seamless sharing and exchange 
of health data, are needed. 


10.6 Health Sector Requirements 


The Healthcare Sectorial Forum was able to identify and name several require- 
ments, which need to be addressed by big data application in the healthcare domain. 
In the following, non-technical and technical requirements will be distinguished 
between. 


10.6.1 Non-technical Requirements 


Business-related requirements are called non-technical requirements and embrace 
important prerequisites and needs for big health data application, such as the need 
for high investments, value-based system incentives, or multi-stakeholder business 
cases. 


Need for High Investments Due to the large-scale nature of big health data, the 
development and maintenance of big data application in the healthcare domain as 
well as the datasets themselves require high investments. Big health data applica- 
tions mainly rely on large-scale, high quality, and often longitudinal healthcare 
data, which require several years of data gathering to establish comprehensive sets 
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of data that can be analysed to produce accurate and insightful results. Such high 
investments can rarely be defrayed by one single party but needs to engage multiple 
stakeholders, which leads directly to the next non-technical requirement. 


Multi-stakeholder Business Cases Due to the high investment needs described 
above, it is often essential that several different stakeholders cooperate in order to 
cover the investment costs. Here the interests of the stakeholders often diverge. 
Another important issue is that the main beneficiaries of a solution are often not the 
ones that are able or willing to finance a complete solution (e.g. patients). Never- 
theless, even though it is often apparent how involved stakeholders could benefit 
from a certain big data solution with aggregated datasets of high quality, it often 
remains unclear whether those stakeholders are able or willing to drive or pay for 
such a solution. 


Need for Value-Based System Incentives In order to increase the effectiveness 
of medical treatments, it is necessary to avoid low-quality reimbursements. This 
means that the current situation of high-number treatments instead of high-quality 
treatments needs to be improved. Since nobody wishes to pay for ineffective 
treatments, the incentives of health systems need to be well aligned with outcomes 
(e.g. performance-based financing and reimbursement systems) and, in addition, the 
cooperation between stakeholders needs to be rewarded. 


10.6.2 Technical Requirements 


Technical requirements are requirements that are related to specific technologies. 
They include semantic data enrichment, data integration and sharing, data privacy 
and security, as well as data quality. A major prerequisite for big data applications 
and analytics is the availability of data in an appropriate digital form. Many 
appropriate technologies are available to fulfil and support this requirement 
(e.g. speech recognition). Therefore no emphasis is put on data digitalization. The 
lack of appropriate digital data in healthcare is mostly caused by the limited 
adoption of data digitalization approaches in the everyday routine and familiar 
workflows of clinicians. 


Semantic Data Enrichment As the IDC market research institute estimates, 
approximately 90 % of health data will be available in an unstructured manner in 
the upcoming years (Ltinendonk GmbH 2013). To facilitate and guarantee seamless 
processing of such data, semantic data enrichment is needed. This means that health 
data, such as medical reports, images, videos, or communications, need to be 
enriched by so-called semantic labels. The major challenge with semantic data 
enrichment is that technological progress needs to be achieved with the analysis of 
several different types of data. 


Data Integration and Sharing In order to avoid data silos or data cemeteries, big 
data has to be efficiently integrated from various different data sources and shared 
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seamlessly. Currently, the adoption of technology to exchange data is lacking 
behind in Europe (Accenture 2012). In the United Kingdom less than 46 % of 
healthcare providers perform healthcare information exchanges, and in Germany 
and France this rate is even lower (approximately 25 %). This requirement goes 
hand in hand with the need for structured or semantically enriched data in order to 
make data easily accessible. A major prerequisite for medical research is the 
possibility to integrate data from various different sources to obtain a longitudinal 
view of the patients’ history. 


Data Security and Privacy When talking about processing, integrating, or 
sharing medical data, a strong emphasis must be put on data security and privacy. 
Medical data is categorized as highly sensitive personal data and therefore protec- 
tion from unauthorized access, manipulation, or damage has to be guaranteed. 
Hence the nature of big data might bypass established privacy protection 
approaches (e.g. when aggregating big data from different data sources). Big health 
data applications need to focus even more strongly on data privacy and security. For 
instance, anonymization is known to be a popular approach to de-identify health- 
related personal data. By aggregating big data from various different data sources, 
anonymized data could be unintentionally re-identified. Therefore, existing privacy 
enhancing methods need to be evaluated to find out whether they can meet all 
privacy requirements even when dealing with big data. If data privacy cannot be 
guaranteed by a specific method, this method needs to be adapted in order to satisfy 
the need for privacy or new methods and approaches need to be developed. Apart 
from the technical challenges, a common international legal framework together 
with guidelines needs to be established in order to provide a common basis for 
international exchange and integration of health-related big data. 


Data Quality High quality of available datasets is a major prerequisite for big 
data applications in the healthcare domain. The benefit of an application is strongly 
correlated with the quality of the data. In the healthcare domain, the quality of the 
available data is often unclear. The frequency of missing or incorrect values is an 
indicator of data quality. Usually the quality of data improves when data is captured 
and processed using high-quality information technology (IT) tools. Such tools can 
be integrated into everyday work routines and perform certain data quality checks 
(e.g. plausibility checks) during the data capturing or entering process. In order to 
generate valuable results or decision support when analysing health data, big data 
applications need to fulfil high quality standards. 
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10.7 Technology Roadmap for Big Data in the Health 
Sector 


The following roadmap outlines and describes technologies and the underlying 
research questions, which meet the requirements defined in the previous section. 
Figure 10.1 visualizes and aligns them with the specific technical requirements. 


10.7.1 Semantic Data Enrichment 


In order to semantically enrich medical data a framework needs to be provided. 
Therefore, semantic enrichment techniques are needed that go beyond the mere 
extraction of relevant information from unstructured text or medical images. 
Semantic labels, which express and define the meaning of information, render the 
original content semantically accessible as well as automatically processable and 
machine-readable. For instance, medical procedure and diagnosis entities in 
unstructured text such as medical reports are recognized and the describing pas- 
sages are linked. Therefore sophisticated text analysis techniques are needed 
(Bretschneider et al. 2013). Furthermore, a standardized enrichment framework, 
which is supporting the technical integration, is needed. To facilitate and improve 


Technical 


Requirement Technology 


Research Question 


Medical IE Algorithm Identification of Relevant Information Entities 


Semantic Data 
Enrichment 


Medical Image Understanding Automated detection of abnormal structures 


Medical Annotation Framework Standards fostering IE algorithm integration 


Data Sharing 
and Integration 


Data Privacy 
and Security 


Data Quality 


Semantic Data Representation 
Semantic Knowledge Models 


Context Representation 


Hash algorithms 
Secure Data Exchange 


De-identification Algorithms 


Provenance Management 
Human-Data Interaction 


Unstructured Data Integration 


Creation of mature data models 
Improvement of existing biomedical ontologies 


Provenance, data usage, licence 


Hash algorithms 
IHE profiles 


Anonymization, Pseudonymization, k-Anonymity 


Trust & permission management mechanism 
Natural language UI & schema agnostic queries 


Unstructured Data Integration 


Fig. 10.1 Mapping requirements to research questions in the healthcare sector 
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semantic enrichment of medical data, advances are needed for the following 
technologies: 


¢ Information extraction from medical texts brings up new challenges to 
classical information extraction techniques, as negation, temporality, and further 
contextual features need to be taken into account. Several studies (Fan and 
Friedman 2011; Savova et al. 2010) show advances towards the specials needs 
of parsing medical text. As the ongoing research mainly focuses on clinical text 
in the English language, adaptations to other European languages are needed. 

¢ Image understanding algorithms to formally capture automatically detected 
image information, such as anatomical structures, abnormal structures, and 
semantic image annotations, are desired. Therefore, additional research 
targeting and considering the complexity of the human body as well as the 
different medical imaging technologies is needed. 

e Standardized medical annotation frameworks that include standardized med- 
ical text processing and support the technical integration of annotation technol- 
ogies. Even though there are some frameworks available (e.g. UIMA’), 
adaptations are needed in order to meet the specific challenges and requirements 
of the healthcare domain. 


10.7.2 Data Sharing and Integration 


Efficient data integration and seamless sharing relies on standardized coding 
schemes and terminologies as well as data models. Currently standardized coding 
systems are either used for high-level information coding (e.g. diseases, laboratory 
values, medications) or not internationally used. A lot of information is not avail- 
able in coded format at all. For the usage of standardized data models, the HL7 
Reference Information Model* (RIM) is considered to become the standard data 
model for EHR implementations. Nevertheless a high percentage of technology 
providers still rely on their own data models when it comes to data integration. In 
order to advance data integration and sharing, coding schemes as well as data 
models need to be improved and standardized. 


e Semantic data models enable the unambiguous representation of data. Existing 
models (e.g. HL7 RIM) have several issues that make it difficult to implement. 
Further research activities, such as the Model for Clinical Information (MCI) 
(Oberkampf et al. 2013) that integrate patient models on the basis of ontologies, 
are ongoing. 

¢ Semantic knowledge models such as biomedical domain ontologies and termi- 
nologies are used in combination with semantic data models and help to 


3 http://uima.apache.org/ 
4 http://www.hl7.org/implement/standards/rim.cfm 
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facilitate semantic interoperability. There are several different models 
(e.g. SNOMED CT?) available, but further research in order to improve these 
standards, as well as to develop new standards, is needed. 

¢ Context information is needed in order to provide information about data 
provenance, usage, or ownership. Therefore standards for describing context 
information are needed. 


10.7.3 Data Privacy and Security 


In order to fulfil the high demand for big health data privacy and security, different 
aspects need to be taken into account. Besides the national data protection laws, a 
common legal framework for the European Union is needed in order to facilitate 
international approaches or cooperation. When talking about big health data pri- 
vacy and security, it is often necessary to re-identify patients (e.g. for longitudinally 
assessing the patient’s health status). The aggregation of data from various different 
data sources brings up two major challenges for big data privacy and security. First, 
the aggregation of data from heterogeneous data sources is difficult and data for 
patient has to be aligned properly. Also the nature of big data may bypass certain 
privacy enhancing methods when aggregating data from various different data 
sources. Therefore advances are needed for the following technologies: 


¢ Hash algorithms are often used as an encryption method. Its one-way function 
can also be used to generate pseudo-identifiers and therefore facilitate secure 
pseudonymization. However it is crucial that hash algorithms are robust and 
collision resistant. 

e Secure data exchange across institutional and country boundaries is essential 
for several interesting visions for the healthcare domain (e.g. international 
EHR). Therefore, Integrating the Healthcare Enterprise (IHE)° profiles are 
widely used (e.g. IHE cross-enterprise document sharing) although they are 
still the focus of research activities. 

¢ De-identification algorithms, such as anonymization or pseudonymization, 
need to be improved in order to guarantee data privacy even when aggregating 
big data from different data sources. K-anonymity (El Emam and Dankar 2008) 
is a promising approach that envisions ensuring anonymity even in the big data 
context. 


5 http://ihtsdo.org/snomed-ct/ 
6 http://www. ihe.net/ 
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10.7.4 Data Quality 


Good data quality is a key-enabler for big health data applications. It depends on 
four different aspects: (1) the data quality of the original data sources, (2) the 
coverage and level of detail of the collected data, (3) common semantics as 
described before, and (4) the handling of media-disruptions. In order to improve 
the data quality of these four aspects, advances for the following technologies are 
needed: 


e An improvement of provenance management is needed in order to allow 
reliable curation of health data. Therefore data-level trust and permission man- 
agement mechanisms need to be implemented. 

¢ Human-data interaction technologies [e.g. natural language interfaces, 
schema—agnostic query formulation (Freitas and Curry 2014)] improve data 
quality as they facilitate ease-of-use interaction that is perfectly integrated in 
particular workflows. 

e Reliable information extraction approaches are needed in order to facilitate 
the processing of unstructured medical data (e.g. medical reports, medical 
images). Therefore existing approaches (e.g. natural language processing) have 
to be improved for the purpose of addressing the specific characteristics of health 
information and data. 


Roadmap developments are usually accomplished for a single company. There is 
aneed to develop a roadmap for the European market that depends on (a) the degree 
to which the non-technical requirements will be addressed and (b) the extent to 
which European organizations are willing to invest in big data developments and 
use case implementations. As such it was not possible to come up with an exact 
timeline of technology milestones, but with an estimated timeline depicted in 
Table 10.1. 


10.8 Conclusion and Recommendations for Health Sector 


Big data technologies and health data analytics provide the means to address the 
efficiency and quality challenges in the health domain. For instance, by aggregating 
and analysing health data from disparate sources, such as clinical, financial, and 
administrative data, the outcome of treatments in relation to the resource utilization 
can be monitored. This analysis in turn helps to improve the efficiency of care. 
Moreover, the identification of high-risk patients with predictive models leads 
towards proactive patient care allowing for the delivery of high quality care. 

A comprehensive analysis of domain needs and requirements indicated that the 
highest impact of big data applications in the healthcare domain is achievable when 
it becomes possible to not only acquire data from one single source, but various data 
sources such that different aspects can be combined to gain new insights. Therefore, 
the availability and integration of all related health data sources, such as clinical 
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Table 10.1 Timeframe of the major expected outcomes for the health sector 
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Technical 
requirement | Year 1 Year 2 Year 3 Year 4 Year 5 
Data Standardized | Knowledge- | Algorithm for anom- | Definition and 
enrichment formats and based infor- | aly detection in implementation of 
interfaces for | mation images medical annotation 
annotation extraction Data enrichment framework 
modules algorithm technologies avail- 
able for a large num- 
ber of different text 
types and multiple 
languages 
Data Context Aligned Common Common semantic Context represen- 
integration representa- semantic semantic data model tation for all 
tion for data | knowledge data model for unstructured patient data 
repositories | models and for patient data 
terminologies | structured 
patient data 
Data security IHE profiles Privacy Anonymization, 
and privacy for secure enhancing through pseudonymization 
data hash algorithms and k-anonymity 
exchange approaches for big 
data 
Data quality | Methods for Natural Integrated workflows | Context-aware 
trust and language UI | for trust integration of 
permission and schema | and permission unstructured data 
management agnostic management 
queries 


data, claims, cost, and administrative data, pharmaceutical and R&D data, patient 
behaviour, and sentiment data as well as the health data on the web, is of high 
relevance. 

However, access to health data is currently only possible in a very constrained 
manner. In order to enable seamless access to healthcare data, several technical 
requirements need to be addressed, including (1) the content of unstructured health 
data (such as images or reports) is enhanced by semantic annotation; (2) data silos 
are conquered by means of efficient technologies for semantic data sharing and 
exchange; (3) technical means backed by legal frameworks ensure the regulated 
sharing and exchange of health data; and (4) techniques for assessing and improv- 
ing data quality are available. 

The availability of the technologies will not be sufficient for fostering wide- 
spread adoption of big data in the healthcare domain. The critical stumbling block is 
the lack of business cases and business models. As big data fosters a new dimension 
of value proposition in healthcare delivery, with insights on the effectiveness of 
treatments to significantly improve the quality of care, new reimbursement models 
that reward quality instead of quantity of treatments are needed. 
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Chapter 11 
Big Data in the Public Sector 


Ricard Munné 


11.1 Introduction 


The public sector is becoming increasingly aware of the potential value to be gained 
from big data. Governments generate and collect vast quantities of data through 
their everyday activities, such as managing pensions and allowance payments, tax 
collection, national health systems, recording traffic data, and issuing official 
documents. This chapter takes into account current socio-economic and techno- 
logical trends, including boosting productivity in an environment with significant 
budgetary constraints, the increasing demand for medical and social services, and 
standardization and interoperability as important requirements for public sector 
technologies and applications. Some examples of potential benefits are as follows: 


e Open government and data sharing: The free flow of information from 
organizations to citizens promotes greater trust and transparency between citi- 
zens and government, in line with open data initiatives. 

e Citizen sentiment analysis: Information from both traditional and new social 
media (websites, blogs, twitter feeds, etc.) can help policy makers to prioritize 
services and be aware of citizens’ interests and opinions. 

¢ Citizen segmentation and personalization while preserving privacy: Tailor- 
ing government services to individuals can increase effectiveness, efficiency, 
and citizen satisfaction. 

e Economic analysis: Correlation of multiple sources of data will help govern- 
ment economists with more accurate financial forecasts. 

¢ Tax agencies: Automated algorithms to analyse large datasets and integration of 
structured and unstructured data from social media and other sources will help 
them validate information or flag potential frauds. 
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¢ Smart city and Internet of things (IoT) applications: The public sector is 
increasingly characterized by applications that rely on sensor measurements of 
physical phenomena such as traffic volumes, environmental pollution, usage 
levels of waste containers, location of municipal vehicles, or detection of 
abnormal behaviour. The integrated analysis of these high volume and high 
velocity IoT data sources has the potential to significantly improve urban 
management and positively impact the safety and quality of life of its citizens. 

e Cyber security: Collect, organize, and analyse vast amounts of data from 
government computer networks with sensitive data or critical services, to give 
cyber defenders greater ability to detect and counter malicious attacks. 


11.1.1 Big Data for the Public Sector 


As of today, there are no broad implementations of big data in the public sector. 
Compared to other sectors, the public sector has not been traditionally using data 
mining technologies intensively. However, there is a growing interest in the public 
sector on the potentials of big data for improvement in the current financial 
environment. 

Some examples of the global growing awareness are the Joint Industry/Govern- 
ment Task Force to drive development of big data in Ireland, announced by the Irish 
Minister for Jobs, Enterprise and Innovation in June 2013 (Government of Ireland 
2013), or the announcement made by the Obama administration (The White House 
2012), on the “Big Data Research and Development Initiative” where six Federal 
departments and agencies announce more than $200 million in new commitments 
to greatly improve the tools and techniques needed to access, organize, and glean 
discoveries from huge volumes of digital data. 


11.1.2 Market Impact of Big Data 


There is no direct market impact nor competition, as the public sector is not a 
productive sector, although its expenditure represented 49.3 % of GDP in 2012 of 
the EU28. The major part of the sector’s income is collected through taxes and 
social contributions. Hence, the impact of big data technologies is in terms of 
efficiency: the more efficient the public sector is, the better off are citizens, as 
less resources (taxes) are needed to provide the same level of service. Therefore, the 
more effective the public sector is, the more positive the impact on the economy, by 
transition for the rest of productive sectors, and more positive impact on society. 
Additionally, the quality of services provided, for example, education, health, 
social services, active policies, and security, can also be improved by making use 
of big data technologies. 
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11.2 Analysis of Industrial Needs in the Public Sector 


The benefits of big data in the public sector can be grouped into three major areas, 
based on a classification of the types of benefits: 


Big Data Analytics This area covers applications that can only be performed 
through automated algorithms for advanced analytics to analyse large datasets for 
problem solving that can reveal data-driven insights. Such abilities can be used to 
detect and recognize patterns or to produce forecasts. 

Applications in this area include fraud detection (McKinsey Global Institute 
2011); supervision of private sector regulated activities; sentiment analysis of 
Internet content for the prioritization of public services (Oracle 2012); threat 
detection from external and internal data sources for the prevention of crime, 
intelligence, and security (Oracle 2012); and prediction for planning purposes of 
public services (Yiu 2012). 


Improvements in Effectiveness Covers the application of big data to provide 
greater internal transparency. Citizens and businesses can take better decisions and 
be more effective, and even create new products and services thanks to the 
information provided. Some examples of applications in this area include data 
availability across organizational silos (McKinsey Global Institute 2011); sharing 
information through public sector organizations [e.g. avoiding problems from the 
lack of a single identity database (e.g. in the UK) (Yiu 2012)]; open government and 
open data facilitating the free flow of information from public organizations to 
citizens and businesses, reusing data to provide new and innovative services to 
citizens (McKinsey Global Institute 2011; Ojo et al. 2015). 


Improvements in Efficiency This area covers the applications that provide better 
services and continuous improvement based on the personalization of services and 
learnings from the performance of such services. Some examples of applications in 
this area are personalization of public services to adapt to citizen needs and 
improving public services through internal analytics based on the analysis of 
performance indicators. 


11.3 Potential Big Data Applications for the Public Sector 


Four potential applications for the public sector were described and developed in 
Zillner et al. (2013, 2014) for demonstrating the use of big data technologies in the 
public sector (Table 11.1). 
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Table 11.1 Summary of application scenarios for the public sector 


Name 


Summary 


Monitoring and supervision of regulated activities for online gambling 
operators 

Large volumes of data available make it difficult to effectively regulate and 
supervise activities 


Synopsis 


To monitor the online gambling operators for the control of regulated activ- 
ities and detection of fraud. The user of this application is the public body in 
charge of the supervisory activity. This procedure is a regulatory obligation 
from the public administration; the online gambling operators must provide 
the information to the regulatory public through a specific communication 
channel. Real-time data is received from gambling operators every 5 min. 


Business 
objectives 


Ensure compliance with regulations, fraud prevention and detection, and 
criminal investigation. 


Name 


Operative efficiency in labour agency 


Summary 


Extract value from available large volumes of unused data 


Synopsis 


Enable a new range of personalized services, improve customer services and 
cut operation costs in German Federal Labour agency. All unemployed 
workers were receiving the same standard services despite having different 
profiles. Historical data on their customers was analysed, including profiles, 
interventions, and the time it took to find a job. Based on this analysis, 
customer segmentation was developed. 


Business 
objectives 


Name 


Reduce the cost and improve the quality of the service: now they are able to 
find a new job in a shorter period of time. 


Public Safety in Smart Cities 


Summary 


Large volumes of data available from sensors, social media, and emergency 
calls can be combined to provide effective public safety. 


Synopsis 


Smart cities equipped with sensors and communication infrastructures help 
the public sector keep cities and their citizens safe. Having accurate and up-to- 
date information allows better and faster responses during emergencies and 
results in less damage and casualties. Typical sources for obtaining such 
information can come from emergency response calls, surveillance cameras, 
and mobile forces (such as a police patrol car) that arrived at a site. In recent 
years social media have shown interesting potential for gathering information 
that aids in obtaining an accurate situational awareness picture (van Kasteren 
et al. 2014). All gathered information is collected in a command and control 
centre where an operator can decide how to steer available mobile forces. 


Business 
objectives 


Quick response to emergencies, prevention of damages, and less casualties. 


Name 


Predictive policing using open data 


Summary 


Reuse of public open data to provide predictive policing 


Synopsis 


Business 
objectives 


Governments around the world have started open data initiatives to make 
public sector data available to the public for the sake of transparency and to 
allow third parties to offer services based on the data. One such service can be 
described as predictive policing where historical crime data is used to auto- 
matically discover trends and patterns. The identified patterns help in gaining 
insights into crime-related problems a city is facing and allow a more effective 
and efficient deployment of police forces (Wang et al. 2013; PredPol 2013). 


Significant decrease in crime, efficient use of mobile forces. 
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11.4 Drivers and Constraints for Big Data in the Public 


Sector 


The key drivers and constraints of big data technologies in the public sector are: 


11.4.1 Drivers 


The following drivers were identified for big data in the public sector: 


Governments can act as catalysts in the development of a data ecosystem 
through the opening of their own datasets, and actively managing their dissem- 
ination and use (World Economic Forum 2012). 

Open data initiatives are a starting point for boosting a data market that can 
take advantage from open information (content) and the big data technologies. 
Therefore active policies in the area of open data can benefit the private sector, 
and in return facilitate the growth of this industry in Europe. In the end this will 
benefit public budgets with an increase of tax incomes from a growing European 
data industry. 


11.4.2 Constraints 


The constraints for big data in the public sector can be summarized as follows: 


Lack of political willingness to make the public sector take advantage of these 
technologies. It will require a change in mind-set of senior officials in the public 
sector. 

Lack of skilled business-oriented people aware of where and how big data can 
help to solve public sector challenges, and who may help to prepare the regula- 
tory framework for the successful development of big data solutions. 

New General Data Protection Regulation and the PSI directives display 
some uncertainties about the impact on the implementation of big data and 
open data initiatives in the public sector. Specifically, open data is set to be a 
catalyst from the public sector to the private sector to establish a powerful data 
industry. 

Gaining adoption momentum. Today, there is more marketing around big data 
in the public sector than real experiences from which to learn which applications 
are more profitable, and how it should be deployed. This requires the develop- 
ment of a standard set of big data solutions for the sector. 

Numerous bodies in public administration (especially in those which are 
widely decentralized), so much energy is lost and will remain so until a common 
strategy is realized for the reuse of cross technology platforms. 
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11.5 Available Public Sector Data Resources 


In Directive 2003/98/EC (The European Parliament and the Council of The 
European Union 2003), on the re-use of public sector information, public sector 
information (PSI) is defined as follows: “It covers any representation of acts, facts 
or information — and any compilation of such acts, facts or information — whatever 
its medium (written on paper, or stored in electronic form or as a sound, visual or 
audio-visual recording), held by public bodies. A document held by a public sector 
body is a document where the public sector body has the right to authorise re-use.” 

According to Correia (2004), concerning the availability of the information 
produced by those public bodies, and in the absence of specific guidelines, the 
producing body is free to decide how to make it available: directly to the end users, 
establishing a public/private partnership, or outsourcing the commercial exploit- 
ation of that information to private operators. The Directive 2003/98/EC clarifies 
that activities falling outside the public task: “will typically include supply of 
documents that are produced and charged for exclusively on a commercial basis 
and in competition with others in the market”. 

On the nature of the PSI available, there are several approaches. The Green paper 
on PSI (European Commission 1998) proposes some classifications such as: 


e PSI distinction between administrative and non-administrative 
e PSI distinction regarding its relevance for the public 


Additionally it can be distinguished according to its potential market value, and 
in some cases according to the content of personal data: 


e PSI distinction according to its anonymity 


The most important amount of data produced by public sector is textual or 
numerical, versus other sectors like healthcare that produces a large amount of 
electronic images. As a result of e-government initiatives of the past 15 years, a 
great part of this data is created in digital form, 90 % according to McKinsey 
(McKinsey Global Institute 2011). 

According to the survey performed for the formulation of the European Big Data 
Value Partnership to public sector representatives (Zillner et al. 2014), the key data 
asset is the whole system of public sector, registries, databases, and information 
systems, of which the most significant are: 


e Citizens, business, and properties (e.g. base registries, transactions) 

e Fiscal data 

e Security data 

¢« Document management especially as the electronic transactions are growing 
¢ Public procurement and expenses 

e Public bodies and employees 

e Geographical data mainly related to cadastral 

e Content related to culture, education, and tourism 

¢ Legislative documents 
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e Statistical data (socio-economic data that could be used by private sector) 
e Geospatial data 


11.6 Public Sector Requirements 


The requirements of the public sector were broken down into non-technical and 
technical requirements. 


11.6.1 Non-technical Requirements 


Privacy and Security Issues The aggregation of data across administrative 
boundaries on a non-request-based manner is a real challenge, since this inform- 
ation may reveal highly sensitive personal and security information when combined 
with various other data sources, not only compromising individual privacy but also 
civil security. Access rights to the required datasets for an operation must be 
justified and obtained. When a new operation is performed over existing data, a 
notification or a license must be obtained from the Data Privacy Agency. Anonym- 
ity must be preserved in these cases, so data dissociation is required. Individual 
privacy and public security concerns must be addressed before governments can be 
convinced to share data more openly, not only publicly but sharing in a restricted 
manner with other governments or international entities. Another dimension is the 
regulation for the use of cloud computing in a way that public sector can trust cloud 
providers. Furthermore, the lack of European big data cloud computing providers 
within the European market is also a barrier for adoption. 


Big Data Skills There’s a lack of skilled data scientists and technologists who can 
capture and process these new data sources. When big data technologies become 
increasingly adopted in business, skilled big data professionals will become harder 
to find. Public body agencies could go a fair distance with the skills they already 
have, but then they will need to make sure those skills advance (1105 Government 
Information Group n.d.). Besides the technical oriented people, there is a lack of 
knowledge in business-oriented people who are aware of what big data can do to 
help them solve public sector challenges. 


Other Requirements Other non-technical requirements include: 


¢ Willingness to supply and to adopt big data technologies, and also to know how 
to use it. 

e Need for common national or European approaches (policies)—like the 
European policies for interoperability and open data. Lack of leadership in this 
field. 
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e A general mismatch between business intelligence in general and big data in 
particular in the public sector. 


11.6.2 Technical Requirements 


Below is a detailed description of each of the eight technical requirements that were 
distilled from the four big data applications selected for the Public Sector Forum. 


Pattern Discovery Identifying patterns and similarities to detect specific criminal 
or illegal behaviours in the application scenario of monitoring and supervision of 
online gambling operators (and also for similar monitoring scenarios within the 
public sector). This requirement is also applicable in the scenario to improve 
operative efficiency in the labour agency, and in the predictive policing scenario. 


Data Sharing/Data Integration Required to overcome lack of standardization of 
data schemas and fragmentation of data ownership. Integration of multiple and 
diverse data sources into a big data platform. 


Real-Time Insights Enable analysis of fresh/real-time data for instant decision- 
making, for obtaining real-time insights from the data. 


Data Security and Privacy Legal procedures and technical means that allow the 
secure and privacy preserving sharing of data. The solutions to this requirement 
may unlock the widespread use of big data in public sector. Advances in the 
protection and privacy of data are key for the public sector, as it may allow the 
analysis of huge amounts of data owned by the public sector without disclosing 
sensitive information. These privacy and security issues are preventing the use of 
cloud infrastructures (processing, storage) by many public agencies that deal with 
sensitive data. 


Real-Time Data Transmission Because the capability of placing sensors is 
increasing in smart city application scenarios, there is a high demand for real- 
time data transmission. It will be required to provide distributed processing and 
cleaning capabilities for image sensors so as not to collapse the communication 
channels and provide just the required information to the real-time analysis, which 
will be feeding situational awareness systems for decision-makers. 


Natural Language Analytics Extract information from unstructured online 
sources (e.g. social media) to enable sentiment mining. Recognition of data from 
natural language inputs like text, audio, and video. 


Predictive Analytics As described in the application scenario for predictive polic- 
ing, where the goal is to distribute security forces and resources according to the 
prediction of incidents, provide predictions based on the learning from previous 
situations to forecast optimal resource allocation for public services. 
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Modelling and Simulation Domain-specific tools for modelling and simulation of 
events according to data from past events to anticipate the results from decisions 
taken to influence the current conditions in real-time, for example, in scenarios of 
public safety. 


11.7 Technology Roadmap for Big Data in the Public 
Sector 


For each requirement in the sector, this section presents applicable technologies and 
the research questions to be developed (Fig. 11.1). All references presented here are 
from Curry et al. (2014). 


11.7.1 Pattern Discovery 


e Data Analysis Technology: Semantic pattern technologies including stream 
pattern matching. 


— Research Question: Scalable complex pattern matching. Reaching trillions 
over datasets will take 5 years. 


e Data Curation Technology: Validation of pattern analytics outputs with humans 
via curation. 


— Research Question: Curation at scale depends on the interplay between 
automated curation platforms and collaborative approaches leveraging large 
pools of data curators. Commercial application results could be reached in 
6-10 years. 


e Data Storage Technology: Analytical Databases, Hadoop, Spark, Mahout. 


— Research Question: Standard Array Query Language. Currently there is a lack 
of standardized query languages but efforts such as ArrayQL are on their way. 
Currently there is no widespread adoption and existing DBs (SciDB, 
Rasdaman) are used in the scientific community. This may change in 3-5 
years from now. 
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Fig. 11.1 Mapping requirements to research questions in the public sector 


Technology 


Semantic pattern technologies 


Validation of outputs with 
humans 


Analytical Databases 


Linked Data and machine 
learning supporting analysis 


In-memory databases 


Facilitate the integration as well 
as analysis 


Linked Data for sharing and 
ontologies for integrating data 


Metadata and data provenance 
frameworks 
Data acquisition: Storm 
Write optimized storage 


solution 


Analytical DBs 


Entity linking and co-reference 
resolution 
Validation of NLA outputs with 
humans via curation 
Temporal databases 
Application of simulation in 


planning 


Encrypted storage and DBs 
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Research Question 


Scalable complex pattern matching for trillions 


Machine learning approaches to the discovery of 
data curation patterns 


Standard Array Query Language 


High performance coping with the 3 Vs 


Ad-hoc queries with minimal latencies 


Fragment selection for graph-like data through 
Quantum computing 


Scalability, high speed and data variety for 
trillion records 


Integration of provenance-awareness into 
existing tools 
Distributed processing and cleaning 


Improving random read/write performance of 
DB technologies 


Efficient support of predictive analytics in DBs 


Increase scalability and robustness 


Software infrastructures integrating NLP 
pipelines into data curation 


Management of time-series data for effective 
analysis 


Making models explicit and/or transparent 


Privacy by design - Queries on encrypted 
storage 


11.7.2 Data Sharing/Data Integration 


e Data Acquisition Technology: To facilitate the integration as well as analysis. 


— Research Question: Data fragment selection, sampling and scalability. Solu- 
tions will be brought about by quantum computers (predicted to be available 
in 5-10 years, but 15—20 years seems more realistic.) 
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e Data Analysis Technology: Linked data provides the best technology set for 
sharing data on the Web. Linked data and ontologies provide mechanisms for 
integrating data (map to same ontology; map between ontologies/schemas/ 
instances). 


— Research Question: Scalability, dealing with high speed of data and high 
variety. Dealing with trillions of nodes will take 3—5 years. 

— Research Question: Making semantic systems easy to use by non-semantic 
(logic) experts. It will take 5 years at least to have a comprehensive tooling 
support. 


e Data Curation/Storage Technology: Metadata and data provenance frameworks. 


— Research Question: What are standards for common data tracing formats? 
Provenance on certain storage types, e.g. graph databases, is still computa- 
tionally expensive. The integration of provenance-awareness into existing 
tools can be achieved in the short term (2-3 years) once this reaches a critical 
market demand. 


11.7.3 Real-Time Insights 


e Data Analysis Technology: Linked data and machine learning technologies can 
support automated analysis, which is required for gaining real-time insights. 


— Research Question: High performance while coping with the 3 Vs (volume, 
variety and velocity). Real-time deep analytics is more than 5 years away. 


e Data Storage Technology: Google Data Flow, Amazon Kinesis, Spark, Drill, 
Impala, in-memory databases. 


— Research Question: How can ad hoc and streaming queries on large datasets 
be executed with minimal latencies? This is an active research field and may 
reach further maturity in a few years’ time. 


11.7.44 Data Security and Privacy 


e Data Storage Technology: Encrypted storage and DBs; proxy re-encryption 
between domains; automatic privacy protection (e.g. differential privacy). 


— Research Question: Advances in “privacy by design” to link analytics needs 
with protective controls in processing and storage. A legal framework, e.g., 
the General Data Protection Regulation (GDPR), has to be harmonized 
among EU member states. Beyond legislation, data and social commons are 
required (Curry et al. 2014). This will require at least a further 3 years of 
research. 
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11.7.5 Real-Time Data Transmission 


e Data Acquisition Technology: Kafka, Flume, Storm, etc., Curry et al. (2014). 


— Research Question: Distributed processing and cleaning. Current approaches 
should be able to let the user know the type of resources that they require to 
perform tasks specified by the user (e.g. process 10 GB/s). First approaches 
towards these ends are emerging and they should be available on the market 
within the next 5 years. 


e Data Storage Technology: Current best practice: write optimized storage solu- 
tion (e.g. HDFS), columnar stores. 


— Research Question: How to improve random read/write performance of 
database technologies. The Lambda Architecture described by Marz and 
Warren reflects the current best practice standard for persisting high velocity 
data. Effectively it addresses the shortcoming of insufficient random/read 
write performances of existing DB technologies. Performance increases will 
be continuous and incremental and simplify overall development of big data 
technology stacks. Technologies could reach a level of maturity that leads to 
simplified architectural blueprints in 3—4 years. 


11.7.6 Natural Language Analytics 


e Data Analysis Technology: Information extraction, named entity recognition, 
machine learning, linked data. Entity linking and co-reference resolution. 


— Research Question: Increasing scalability and robustness. Robust scalable 
solutions are at least 3—5 years away. 


e Data Curation Technology: Validation of Natural Language Analytics (NLA) 
outputs with humans via curation. 


— Research Question: Curation at scale depends on the interplay between 
automated curation platforms and collaborative approaches leveraging large 
pools of data curators. Technically, this integration can be achieved in the 
short term (2-3 years). 


11.7.7 Predictive Analytics 


e Data Storage Technology: Analytical databases. 


— Research Question: How can databases efficiently support predictive ana- 
lytics? From a storage point of view, analytical databases address the problem 
of better performance as the DB itself is able to execute analytical code. 
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Currently there is a lack of standardized query languages but efforts such as 
ArrayQL are on their way. This may change in 3—5 years from now. 


11.7.8 Modelling and Simulation 


e Data Storage Technology: Best practices; batch and in-stream processing 
(Lambda architecture), temporal databases. 


— Research Question: How can time-series data be managed in a general way 
for effective analysis? Spatiotemporal databases are an active research field 
and results may be beyond a 5-year time scale. 


e Data Usage Technology: Standards in (semantic) modelling; application of 
simulation in planning (e.g. plant planning). 


— Research Question: Making models explicit and/or transparent. This is a 
research question with a long timeline (beyond 2020). 


11.8 Conclusion and Recommendations for the 
Public Sector 


The findings after analysing the requirements and the technologies currently avail- 
able show that there are a number of open research questions to be addressed in 
order to develop the technologies such that competitive and effective solutions can 
be built. The main developments are required in the fields of scalability of data 
analysis, pattern discovery, and real-time applications. Also required are improve- 
ments in provenance for the sharing and integration of data from the public sector. 

It is also extremely important to provide integrated security and privacy mecha- 
nisms in big data applications, as the public sector collects vast amounts of sensitive 
data. In many countries legislation limits the use of the data only for purposes for 
which it was originally obtained. In any case, respecting the privacy of citizens is a 
mandatory obligation in the European Union. 

Other areas, especially interesting for the safety applications in public sector, are 
the analysis of natural language, which can be useful as a method to gather 
unstructured feedback from citizens, e.g. from social media and networks. The 
development of effective predictive analytics, as well as modelling and simulation 
tools for the analysis of historical data, are key challenges to be addressed by 
future research. 


Open Access This chapter is distributed under the terms of the Creative Commons Attribution- 
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Chapter 12 
Big Data in the Finance and Insurance 
Sectors 


Kazim Hussain and Elsa Prieto 


12.1 Introduction 


The finance and insurance sector by nature has been an intensively data-driven 
industry for many years, with financial institutes having managed large quantities of 
customer data and using data analytics in areas such as capital market trading. The 
business of insurance is based on the analysis of data to understand and effectively 
evaluate risk. Actuaries and underwriting professionals depend upon the analysis of 
data to be able to perform their core roles; thus it is safe to state that this data is a 
dominant force in the sector. 

There is however an increase in prevalence of data which falls into the domain of 
big data, i.e. high volume, high velocity, and high variety of information assets born 
out of the advent of new customer, market, and regulatory data surging from 
multiple sources. To add to the complexity is the co-existence of structured and 
un-structured data. Unstructured data in the financial services and insurance indus- 
try can be identified as an area where there is a vast amount of un-exploited business 
value. For example, there is much commercial value to be derived from the large 
volumes of insurance claim documentation which would predominately be in text 
form and contains descriptions entered by call centre operators, notes associated 
with individual claims and cases. With the help of big data technologies not only 
can value be more efficiently extracted from such a data source, but the analysis of 
this form of unstructured data in conjunction with a wide variety of datasets to 
extract faster, targeted commercial value. An important characteristic of big data in 
this industry is value—how can a business not only collect and manage big data, but 
how can the data which holds value be identified and how can organizations 
forward-engineer (as opposed to retrospectively evaluate) commercial value from 
the data. 
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12.1.1 Market Impact of Big Data 


The market for big data technology in the financial and insurance domains is one of 
the most promising. According to TechNavio’s forecast (Technavio 2013), the 
global big data market in the financial services sector will grow at a CAGR of 
56.7 % over the period 2012-2016. One of the key factors contributing to this 
market growth is the need to meet financial regulations, but the lack of skilled 
resources to manage big data could pose a challenge. 

The key vendors dominating this space include Hewlett-Packard, IBM, 
Microsoft, and Oracle that are global well-established players with a generalist 
profile. However, the appeal of the market will be a pull factor on new entrants in 
the coming years. 

With data being the most important asset, this technology is especially 
favourable and differentiating for financial services organizations, as said by the 
IBM Institute for Business Value’s report “Analytics: The real-world use of big 
data in financial services” (IBM 2013). By leveraging this asset, banks and financial 
markets firms can gain a comprehensive understanding of markets, customers, 
channels, products, regulations, competitors, suppliers, and employees that will 
let them better compete. Therefore, this is a positive trend in the market and is 
expected to drive the growth of the global big data market in the financial services 
sector. 

In terms of data strategy, financial services organizations are taking a business- 
driven approach to big data: business requirements are identified in the first place 
and then existing internal resources and capacities are aligned to support the 
business opportunity, before investing in the sources of data and infrastructures. 
However, not all financial organizations are keeping the same pace. According to 
the IBM report, while 26 % are focused on understanding the principal notions 
(compared with 24 % of global organizations), the majority are either defining a 
roadmap related to big data (47 %) or already conducting big data pilots and 
implementations (27 %). 

Where they lag behind their cross-industry peers is in using more varied data 
types within their big data implementations. Slightly more than 21 % of these firms 
are analysing audio data (often produced in abundance in retail banks’ call centres), 
while slightly more than 27 % report analysing social data (compared to 38 % and 
43 %, respectively, of their cross-industry peers). This lack of focus on unstructured 
data is attributed to the on-going struggle to integrate the organizations’ massive 
structured data. 
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12.2 Analysis of Industrial Needs in the Finance 
and Insurance Sectors 


The advent of big data in financial services can bring numerous advantages to 
financial institutions. Benefits that come with the greatest commercial impact are 
highlighted as follows: 


Enhanced Levels of Customer Insight, Engagement, and Experience With the 
digitization of financial products and services and the increasing trend of customers 
interacting with brands or organizations in the digital space, there is an opportunity 
for financial services organizations to enhance their level of customer engagement 
and proactively improve the customer experience. Many argue that this is the most 
crucial area for financial institutes to start leveraging big data technology to stay 
ahead, or even just keep up with competition. To help achieve this, big data 
technologies and analytical techniques can help derive insight from newer unstruc- 
tured sources such as social media. 


Enhanced Fraud Detection and Prevention Capabilities Financial services 
institutions have always been vulnerable to fraud. There are individuals and crim- 
inal organizations working to defraud financial institutions and the sophistication 
and complexity of these schemes is evolving with time. In the past, banks analysed 
just a small sample of transactions in an attempt to detect fraud. This could lead to 
some fraudulent activities slipping through the net and other “false positives” being 
highlighted. Utilization of big data has meant these organizations are now able to 
use larger datasets to identify trends that indicate fraud to help minimize exposure 
to such a risk. 


Enhanced Market Trading Analysis Trading the financial markets started 
becoming a digitized space many years ago, driven by the growing demand for 
the faster execution of trades. Trading strategies that make use of sophisticated 
algorithms to rapidly trade financial markets are a major benefactor of big data. 


Market data can be considered itself, as big data. It is high in volume, it is 
generated from a variety of sources, and it is generated at a phenomenal velocity. 
However, this big data does not necessarily translate into actionable information. 
The real benefit from big data lies in effectively extracting actionable information 
and integrating this information with other sources. Market data from multiple 
markets and geographies as well as a variety of asset classes can be integrated with 
other structured and unstructured sources to create enriched, hybrid datasets 
(a combination of structured and unstructured data). This provides a comprehensive 
and integrated view of the market state and can be used for a variety of activities 
such as signal generation, trade execution, profit and loss (P&L) reporting, and risk 
measurement, all in real-time hence enabling more effective trading. 
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12.3 Potential Big Data Applications in Finance 
and Insurance 


Three potential applications for the finance and insurance sector were described and 
developed in Zillner et al. (2013, 2014) as representatives of the application of big 
data technologies in the sector (Table 12.1). 


Table 12.1 Summary of big data application scenarios for the finance and insurance sector 


Name Market manipulation detection. 
Summary Detection of false rumours that try to manipulate the market. 
Synopsis Financial markets are often influenced by rumours. Sometimes false rumours 


are intentionally placed in order to distract and mislead other market partic- 
ipants. These behaviours differ based on the intended outcome of the manip- 
ulation. Examples of market abuse are market sounding (the illegal 
dissemination of untrue information about a company whose stock is traded 
on exchanges) and pump and dump (false positive reports are published about 
a company whose shares are tradable with the goal of encouraging other 
market participants to buy stock in the corresponding company; an increase in 
demand would cause the price of the stock to rise to an artificial level). 


Business Identifying hoaxes and assessing the consistency of new information with 
objectives other reliable sources. 

Name Reputational risk management. 

Summary Assessment of exposure to reputational risk connected to consulting services 


offered by banks to their customers. 


Synopsis A negative perception can adversely affect a bank’s ability to maintain 
existing, establish new business relationships, or continued access to sources 
of funding. The increase in the probability of default (issuer credit risk), the 
price volatility, and the difficulties to exchange specific financial products on 
restricted markets have all contributed to the increase of the reputational and 
operational risk associated with brokerage and advisory services. Banks and 
financial institutions usually offer third party financial products. This implies 
that a lack of performance of a third party product could have real impacts on 
the relationship between the bank and its customers. 


Business To monitor third parties’ reputation and the effects of reputation disruption on 
objectives the direct relationship between banks and customers. 

Name Retail brokerage. 

Summary Discover topic trends, detect events, or support the portfolio optimization/ 


asset allocation. 


Synopsis A general trend in the whole industry of retail brokerage and market data is to 
come up with functionalities that offer actionable information. The focus is no 
longer on figures based on quantitative historical data, e.g. key figures or 
performance data. Instead, investors look for signals that have some kind of 
predictive element yet are easy to understand. In that sense, the extraction of 
sentiments and topics from textual sources is a perfect add-on for the con- 
ventional data and functionalities that are already offered by retail brokerage 


companies. 
Business Collecting and reviewing various sources of financial information 
objectives (on markets, companies, or financial institutions) repeatedly by automation of 


this task. 
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12.4 Drivers and Constraints for Big Data in the Finance 


and Insurance Sectors 


The successful realization of big data in finance and insurance has several drivers 
and constraints. 


12.4.1 Drivers 


The following drivers were identified for big data in the finance and insurance 
sector: 


Data Growth: Financial transaction volumes are increasing, leading to data 
growth in financial services firms. In capital markets, the presence of electronic 
trading has led to an increase in the number of trades. Data growth is not limited 
to capital markets businesses. The Capgemini/RBS Global Payments study for 
2012 (Capgemini 2012) estimates that the global number of electronic payment 
transactions is about 260 billion and growing between 15 and 22 % for devel- 
oping countries. 

Increasing scrutiny from regulators: Regulators of the industry now require a 
more transparent and accurate view of financial and insurance businesses, this 
means that they no longer want reports; they need raw data. Therefore financial 
institutions need to ensure that they are able to analyse their raw data at the same 
level of granularity as the regulators. 

Advancements in technology mean increased activity: Thanks largely to the 
digitization of financial products and services, the ease and affordability of 
executing financial transactions online has led to ever-increasing activity and 
expansion into new markets. Individuals can make more trades, more often, 
across more types of accounts, because they can do so with the click of a button 
in the comfort of their own homes. 

Changing business models: Driven by the aforementioned factors, financial 
institutions find themselves in a market that is fundamentally different from the 
market of even a few years ago. Adoption of big data analytics is necessary to 
help build business models for financial institutions geared towards retention of 
market share from the increasing competition coming from other sectors. 
Customer insight: Today the relationship between banks and consumers has 
been reversed: consumers now have transient relationships with multiple banks. 
Banks no longer have a complete view of their customer’s preferences, buying 
patterns, and behaviours. Big data technologies therefore play a focal role in 
enabling customer centricity in this new paradigm. 
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12.4.2 Constraints 


The constraints for big data in the finance and insurance sector can be summarized 
as follows: 


Old culture and infrastructures: Many banks still depend on old rigid IT 
infrastructure, with data siloes and a great many legacy systems. Big data, 
therefore, is an add-on, rather than a completely new standalone initiative. The 
culture is an even bigger barrier to big data deployment. Many financial organ- 
izations fail to implement big data programs because they are unable to appre- 
ciate how data analytics can improve their core business. 

A lack of skills: Some organizations have recognized the data and the oppor- 
tunities the data presents; however they lack human capital with the right level of 
skills to be able to bridge the gap between data and potential opportunity. The 
skills that are “missing” are those of a data scientist. 

Data “Actionability”: The next main challenge can be seen in making big data 
actionable. Big data technology and analytical techniques enable financial ser- 
vices institutions to get deep insight into customer behaviour and patterns, but 
the challenge still lies in organizations being able to take specific action based on 
this data. 

Data privacy and security: Customer data is a continuing cause for concern. 
Regulation remains a big unknown: what is and is not legally permissible in the 
ownership and use of customer data remains ill-defined, and that is an inhibiting 
factor to rapid and large-scale adoption. 


12.5 Available Finance and Insurance Data Resources 


The financial service system has several major pools of data that are held by 
different stakeholders/parties. Data are classified into three major categories: 


Structured Data This refers to information with a high degree of organization, 
such that inclusion in a relational database is seamless and readily searchable by 
simple, straightforward search engine algorithms, or other search operations. 
Examples of financial structured data sources are: 


Trading systems (transaction data) 

Account systems (data on account holdings and movements) 
Market data from external providers 

Securities reference data 

Price information 

Technical indicators 
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Unstructured Data Although the financial industry has previously focused on 
high velocity market data, it is now moving towards unstructured data to changing 
trading dynamics. Examples of financial unstructured data are: 


— Daily stock feeds 

— Company announcements (ad-hoc news) 
— Online news media 

Articles/blogs 

Customers’ feedback/experiences 


Semi-structured Data A form of structured data that does not conform to the 
formal structure of data models associated with relational databases or other forms 
of data tables, but even so contains tags or markers to separate semantic elements 
and enforce hierarchies of records and fields within the data. Examples of semi- 
structured data are expressed in meta-languages (mostly XML-based) such as: 


— Financial products Markup Language (FpML) 

— Financial Information eXchange (FIX) 

— Interactive Financial eXchange (IFX) 

— Market Data Definition Language (MDDL) 

— Financial Electronic Data Interchange (FEDI) 

— Open Financial eXchange (OFX) 

— eXtensible Business Reporting Language (XBRL) 
— SWIFTStandards 


Nowadays the amount of unstructured information in enterprises is around 
80-85 %. The financial and insurance industry has vast repositories of structured 
data in comparison to other industries, with a large amount of this information 
having its origin inside the organization. 


12.6 Finance and Insurance Sector Requirements 


12.6.1 Non-technical Requirements 


Data Protection and Privacy Particularly in the EU, there are numerous data 
protection and privacy issues to consider when undertaking big data analytics. 
Regulatory requirements dictate that personal data must be processed for specified 
and lawful purposes and that the processing must be adequate, relevant, and not 
excessive. The impact of these principles for financial services organizations is 
significant, with individuals being able to ask financial services organizations to 
remove or refrain from processing their personal data in certain circumstances. 
This requirement could lead to increased costs for financial services organ- 
izations, as they deal with individuals’ requests. This removal of data may also 
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lead to the dataset being skewed, as certain groups of people will be more active and 
aware of their rights than others. 


Confidentiality and Regulatory Requirements Any information related by a 
third party that is subject to big data analytics is likely to be confidential inform- 
ation. Therefore, financial services organizations will need to ensure that they 
comply with their obligations and that any use of such data does not give rise to a 
breach of their confidentiality or regulatory obligations. 


Liability Issues Just because big data contains an enormous amount of inform- 
ation, it does not mean that it reflects a representative sample of the population. 
Therefore there is a risk of misinterpreting the information produced and liability 
may arise where reliance is placed on that information. This is a factor that financial 
services organizations have to take into account when looking at using big data in 
analytical models and ensuring that any reliance placed upon the output comes with 
relevant disclaimers attached. 


12.6.2 Technical Requirements 


Data Extraction and Sentiment Classification Though the definition of senti- 
ment is vague, in general, a sentiment on an object is a positive or negative view, 
attitude, emotion, or appraisal on or from a document author or actor. 

Sentiment is often expressed in a domain-specific way, and using non-domain- 
specific vocabulary may lead to misclassifications. The goal is to extract facts and 
sentiments concerning the financial use cases: financial instruments, situations, 
conditions, indicators, and experts’ assessments regarding these instruments, as 
well as investors’ sentiment, etc. The classification of sentiment can be done at 
several levels: words, phrases, sentences, paragraphs, documents, and even multi- 
ple documents, and then aggregate. 

Data extraction needs to cope with noise, misinformation, irony, bias, or uncer- 
tainty. In addition, with sentiment it is important not only to determine the senti- 
ment of a piece of information, but how words affect the semantic orientation and 
how sentiment changes. 


Data Quality The more timely, accurate, and relevant the data (along with good 
analytics), the better the assessment of the current financial state is. This requires 
better processes of identifying and maintaining the data sources of interest, verify- 
ing, cleaning, transforming, integrating, and deduplicating data. Due to the large 
amount of available data, there is a need for automation and scalability processes. 
Language detection methods also need to be refined to improve precision and 
reliability. 


Data Acquisition For banks and financial services providers, the volume of data 
they generate, consume, store, and access will increase exponentially year over 
year. The applications depend on acquiring and accessing massive amounts of 
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historical heterogeneous information and live feeds of unstructured, semi struc- 
tured, and structured information. A significant amount of data comes from internal 
structured data, though there is a growing trend towards external unstructured data 
(from news, blogs, articles, social networks, and websites). Even when there can be 
a wide variety of data sources to access, the actual ones that are required depend on 
the design for a specific application. 


Data Integration/Sharing This describes the task to overcome the heterogeneity 
of disparate data sources in terms of hardware, software, syntax, and/or semantics 
by providing access tools that enable interoperability. 

The data is usually scattered among different heterogeneous sources with dif- 
fering conceptual representations (different structures and data semantics) but it is 
encapsulated into a single, homogeneous data source to the end user. 

The motivation for integration may be based on strategic or operational consi- 
derations. Regarding strategic considerations and analysis, it may not be required to 
constantly integrate the data but to integrate data snapshots at a certain point in 
time. For operational analysis a real-time integration of the most up-to-date inform- 
ation may be required. 

Typically data integration is not a once-off conversion but an on-going task, 
therefore poses the additional constraint that the chosen solution needs to be robust 
in terms of adaptability, extensibility, and scalability. Approaches leveraging 
standards such as eXtensible Business Reporting Language (XBRL) and Linked 
Data show promise (O’Riain et al. 2012). 

This rapid generation of continuous streams of information has challenged the 
storage, computation, and communication capabilities in computing systems, as 
they impose high resource requirements on data stream processing systems. 


Decision Support Systems (DSS) Model-driven DSS emphasises access to and 
manipulation of statistical, financial, optimization, and/or simulation models. 
Models use data and parameters to aid decision-makers in analysing a situation, 
for instance, assessing and evaluating decision alternatives and examining the 
effect of changes. This requires integrating information from the knowledge base 
into financial event detection models, visualization models, decision-models, and 
for scalable execution of these models. 

For some application scenarios, the response of the system should support real- 
time or near-real-time insights. The velocity of the response is subject to the end 
user requirements. 

In DSS, visualization is an extremely useful tool for providing overviews and 
insights into overwhelming amounts of data to support the decision-making 
process. 


Data Privacy and Security Top priorities for the financial sector today include 
on-going regulatory compliance [e.g. Sarbanes-Oxley (SOX) Act, U.S. Government 
(2002); EU data protection directive, Parliament (1995); cyber security directive, 
Parliament (2013)] and risk mitigation, continued adaptation to the expectations of 
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consumers for anywhere/anytime service, reducing operational costs, and increasing 
efficiencies through use of cloud-based services. 

Banking and financial institutions need to secure the storage, transit, and use of 
corporate and personal data across business applications, including online banking 
and electronic communications of sensitive information and documents. 

The increasingly global nature and high-interconnectivity of the industry makes 
it necessary to comprehensively address international data security and privacy 
regulations, from the front to the back-end, and along the full supply chain, 
including third parties. Data is not always stored in-house but with third parties. 
Using commercial “cloud” services as data storage locations poses potential pri- 
vacy and security problems since the terms of service for these products are often 
poorly understood. 


12.7 Technology Roadmap for Big Data in the Finance 
and Insurance Sectors 


For each requirement in the sector, this section presents applicable technologies and 
the research questions to be developed (Fig. 12.1; Table 12.2). All references 
presented here are from Curry et al. (2014). 


12.7.1 Data Acquisition 


e Acquisition pipeline technology. 


— Research Question: Data stream management. Current data analysis in the 
stored-data domain shall need to move to management of data in the data 
stream itself. 


e Proprietary APIs technology. 


— Research Question: Privacy and anonymization at collection time. The data 
collection process shall require intrinsic data anonymization and/or 
decoupling of personal data from data emanating from business processes 
or otherwise. 

— Research Question: Social APIs. Moving ahead of existing proprietary 
(or even open) APIs, social APIs into financial services datasets need to be 
investigated. 
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Technical z 
Requirement Technology Research Question 
Acquisition pipeline Data stream management 

Data Acquisition Privacy and anonymization at collection time 


APIs technology 
Social APIs 


7 Scalable data curation and validation 
Manual processing and 


Data quality validation New methods to improve precision and 
reliability 
Language modelling Statistical language models 
Data extraction Machine Learning Required inference functionality 
Scalability in real-time Processing of large datasets 


User-specific integration 
Wrappers/mediators to 
Data integration encapsulate distributed & 
/ sharing automatic data and schema 
mapping Scaling methods for large data volumes and 
near-real time processing. 


Data variety: sentiments, quantitative 
information 


Stream-based data mining 
Multi-attribute decision models 


Decision support Machine learning adaptation to evolving content 


Resource allocation in mining Improved storage, computation and 
data streams communication capabilities 


Privacy by design | Security by design 


Roles-based IdM and access Data Security for public-private hybrid 
Data privacy & control environments 
security Enhanced Compliance management 


Apply external encryption and authentication 


Database encryption NoSQL controls 


Fig. 12.1 Mapping requirements to research questions in the finance and insurance sectors 


12.7.2 Data Quality 


e Manual processing and validation technology. 


— Research Question: Scalable data curation and validation. 
— Research Question: New methods to improve precision and reliability. 


12.7.3 Data Extraction 


e Language modelling technology. 


— Research Question: Obtaining keywords and key-phrases by using statistical 
language models. 
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Table 12.2 Timeframe of the major expected outcomes of the big data roadmap for the finance 
and insurance sector 


Technical 
requirement Year 1 | Year 2 Year 3 Year 4 | Year 5 
Data Social APIs Data stream man- 
acquisition agement. 
Privacy and 


anonymization at 
collection time 


Data Scalable data curation New methods to 
quality and validation improve precision 
and reliability 
Data Statistical lan- | New machine learning Processing of 
extraction guage models | techniques to satisfy large datasets 


the newly required 
inference functionality 


Data inte- User-specific 
gration/ integration 
sharing Data variety: sen- 


timents, quantita- 
tive information 
Scaling methods 
for large data vol- 
umes and near- 


real-time 

processing 
Decision Stream-based Machine learning 
support data mining adaptation to 


evolving content 
Improved storage, 
computation, and 
communication 
capabilities 


Data pri- Apply external | Privacy by design | Data security for 
vacy and encryption and | Security by design public-private 
security authentication hybrid environ- 
controls ments 

Enhanced com- 
pliance 
management 


e Machine Learning technology. 


— Research Question: The size of datasets in financial services makes it neces- 
sary for new machine learning techniques to satisfy the newly required 
inference functionality. 


¢ Scalability in real-time technology: Real-time information is of interest in some 
application scenarios of financial services. 
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— Research Question: The challenge of processing large datasets represents a 
requirement for research in the scalability of data processing in real-time as 
datasets grow in size and number. 


12.7.44 Data Integration/Sharing 


e Wrappers/mediators to encapsulate distributed data and automatic data and 
schema mapping technology: Sources of data in the financial services industry 
can be distributed across organizations, or across time and space. 


— Research Question: User-specific integration. Integration of data for the 
benefit of specific users (namely, business processes, or target end user 
organizations). 

— Research Question: Data variety: sentiments, quantitative information. 

— Research Question: Scaling methods for large data volumes and near-real- 
time processing. This research challenge is in relation to the “scalability in 
real time” described earlier, under “data extraction”. 


12.7.5 Decision Support 


e Miulti-attribute decision-models technology: The availability of information 
from multiple sources will provide multiple attribute types that become available 
to include in decision-models. 


— Research Question: Stream-based data mining. 
— Research Question: Machine learning adaptation to evolving content. 


¢ Resource allocation in mining data streams technology: Elastic computing today 
allows for dynamic resource allocation as required. Improvements may be 
required in resource allocation for near real-time support to decision-making. 


— Research Question: Improved storage, computation, and communication 
capabilities. 


12.7.6 Data Privacy and Security 


e Roles-based identity management and access control technology: access control 
in the context of large datasets will pose a problem when sensitive data (business 
process related) begins to be exploited in large datasets and integrated with other 
data, and accessed by third parties. 
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— Research Question: Privacy by design | Security by design. 

— Advances in “privacy by design” to link analytics needs with protective 
controls in processing and storage. 

— Research Question: Data Security for public-private hybrid environments. 

— The advent of cloud storage and computation services, however, comes at the 
expense of data security and user privacy. 

— Research Question: Enhanced Compliance management (data protection, 
others). Research has already been initiated, but needs to continue in provid- 
ing methodologies and infrastructures that facilitate the monitoring, enforce- 
ment, and audit of quantifiable indicators on the security of a business 
process. 

— Database encryption technology: The security concept of NoSQL databases 
generally relies on external enforcing mechanisms. 

— Research Question: Review the security architecture and policies of the 
overall system and apply external encryption and authentication controls to 
safeguard NoSQL databases. Data security for public-private hybrid 
environments. 


12.8 Conclusion and Recommendations for the Finance 
and Insurance Sectors 


The Finance and insurance sector analysis for the roadmap is based on four major 
application scenarios based on exploiting banks and insurance companies’ own data 
to create new business value. The findings of this analysis show that there are still 
research challenges to develop the technologies to their full potential in order to 
provide competitive and effective solutions. These challenges appear at all levels of 
the big data value chain and involve a wide set of different technologies, which 
would make necessary a prioritization of the investments in R&D. In broad terms 
there seems to be a general agreement on real-time aspects, better data quality 
techniques, scalability of data management and processing, better sentiment clas- 
sification methods, and compliance with security requirements along the supply 
chain. However, it is worth mentioning the importance of the application scenario 
and the real needs of the end user in order to determine these priorities. At the same 
time, apart from the technological aspects, there are organizational, cultural, and 
legal factors that will play a key role in how the financial services market takes on 
big data for its operations and business development. 
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Chapter 13 
Big Data in the Energy and Transport Sectors 


Sebnem Rusitschka and Edward Curry 


13.1 Introduction 


The energy and transport sectors are currently undergoing two main transform- 
ations: digitization and liberalization. Both transformations bring to the fore typical 
characteristics of big data scenarios: sensors, communication, computation, and 
control capabilities through increased digitization and automation of the infrastruc- 
ture for operational efficiency leading to high-volume, high-velocity data. In 
liberalized markets, big data potential is realizable within consumerization scenar- 
ios and when the variety of data across organizational boundaries is utilized. 

In both sectors, there is a connotation that the term “big data” is not sufficient: 
the increasing computational resources embedded in the infrastructures can also be 
utilized to analyse data to deliver “smart data”. The stakes are high, since the 
multimodal optimization opportunities are within critical infrastructures such as 
power systems and air travel, where human lives could be endangered, not just 
revenue streams. 

In order to identify the industrial needs and requirements for big data techno- 
logies, an analysis was performed of the available data sources in energy and 
transport as well as their use cases in the different categories for big data value: 
operational efficiency, customer experience, and new business models. The energy 
and transport sectors are quite similar when it comes to the prime characteristics 
regarding big data needs and requirements as well as future trends. A special area is 
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the urban setting where all the complexity and optimization potentials of the energy 
and transport sectors are focused within a concentrated regional area. 

The main need of the sectors is a virtual representation of the underlying 
physical system by means of sensors, smart devices, or so-called intelligent elec- 
tronic devices as well as the processing and analytics of the data from these devices. 
A mere deployment of existing big data technologies as used by the big data natives 
will not be sufficient. Domain-specific big data technologies are necessary in the 
cyber-physical systems for energy and transport. Privacy and confidentiality pre- 
serving data management and analysis is a primary concern of all energy and 
transport stakeholders that are dealing with customer data. Without satisfying the 
need for privacy and confidentiality, there will always be regulatory uncertainty and 
barriers to customer acceptance of new data-driven offerings. 


13.2 Big Data in the Energy and Transport Sectors 


The following section examines the dimensions of big data in energy and transport 
to identify the needs of business and end users with respect to big data technologies 
and their usage. 


Electricity Industry Data is coming from digitalized generator substations, trans- 
former substations, and local distribution substations in an electric grid infrastruc- 
ture of which the ownership has been unbundled. Information can come in the form 
of service and maintenance reports from field crews about regular and unexpected 
repairs, health sensor data from self-monitoring assets, data on end usage and 
power feed-in from smart meters, and high-resolution real-time data from 
GPS-synchronized phasor measurement units or intelligent protection and relay 
devices. An example use case comes from Electricité de France (EDF) (Picard 
2013), where they currently “do a standard meter read once a month. With smart 
meters, utilities have to process data at 15-min intervals. This is about a 3000-fold 
increase in daily data processing for a utility, and it’s just the first wave of the data 
deluge. Data comes from individual load curves, weather data, contractual infor- 
mation; network data 1 measure every 10 min for a target of 35 million customers. 
The estimated annual data volume would be 1800 billion records or 120 TB of raw 
data. The second wave will include granular data from smart appliances, electric 
vehicles, and other metering points throughout the grid. That will exponentially 
increase the amount of data being generated.” 


Oil and Gas Industry Data comes from digitalized storage and distribution 
stations, but wells, refineries, and filling stations are also becoming data sources 
in the intelligent infrastructure of an integrated oil and gas company. Down hole 
sensors from production sites deliver data on a real-time basis including pressure, 
temperature, and vibration gauges, flow meters, acoustic and electromagnetic, 
circulation solids. Other data comes from sources such as vendors, tracking service 
crews, measurements of truck traffic, equipment and hydraulic fracturing, water 


13 Big Data in the Energy and Transport Sectors 227 


usage; Supervisory Control and Data Acquisition (SCADA) data from valve and 
pump events, asset operating parameters, out of condition alarms; unstructured 
reserves data, geospatial data, safety incident notes, and surveillance video streams. 
An example use case comes from Shell (Mearian 2012) where “optical fiber 
attached to down hole sensors generate massive amounts of data that is stored at 
a private isolated section of the Amazon Web Services. They have collected 
46 petabytes of data and the first test they did in one oil well resulted in 1 petabyte 
of information. Knowing that they want to deploy those sensors to approximately 
10,000 oil wells, we are talking about 10 Exabytes of data, or 10 days of all data 
being created on the Internet. Because of these huge datasets, Shell started piloting 
with Hadoop in the Amazon Virtual Private Cloud”. Others examples in the 
industry include (Nicholson 2012): “Chevron proof-of-concept using Hadoop for 
seismic data processing; Cloudera Seismic Hadoop project combining Seismic 
Unix with Apache Hadoop; PointCross Seismic Data Server and Drilling Data 
Server using Hadoop and NoSQL”. 


Transportation In transportation the number of data sources is increasing rapidly. 
Air and seaports, train and bus stations, logistics hubs, and warehouses are increas- 
ingly employing sensors: Electronic on board recorders (EOBRs) in trucks deliv- 
ering data on load/unload times, travel times, driver hours, truck driver logs, pallet 
or trailer tags delivering data on transit and dwell times, information on port strikes, 
public transport timetables, fare systems and smart cards, rider surveys, GPS 
updates from vehicle fleet, higher volumes of more traditional data from established 
sources such as frequent flyer programs, etc. An example use case comes from the 
City of Dublin (Tabbitt 2014) where the “road and traffic department is now able to 
combine big data streaming from an array of sources—bus timetables, inductive- 
loop traffic detectors, closed-circuit television cameras, and GPS updates that each 
of the city’s 1000 buses transmits every 20 s—to build a digital map of the city 
overlaid with the real-time positions of Dublin’s buses using stream computing and 
geospatial data. Some interventions have led to a 10-15 % reduction in journey 
times”. 


13.3 Analysis of Industrial Needs in the Energy 
and Transport Sectors 


Business needs can be derived from the previous dimensioning of big data and 
examples from within the energy and transport sectors: 


Ease of use regarding the typical big data technologies will ultimately ensure 
wide-scale adoption. Big data technologies employ new paradigms and mostly 
offer programmatic access. Users require software development skills and a deep 
understanding of the distributed computing paradigm as well as knowledge of the 
application of data analytics algorithms within such distributed environments. This 
is beyond the skillset of most business users. 
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Semantics of correlations and anomalies that can be discovered and visualized via 
big data analytics need to be made accessible. Currently only domain and data 
experts together can interpret the data outliers; business users are often left with 
guesswork when looking at the results of data analytics. 


Veracity of data needs to be guaranteed before it is used in energy and transport 
applications. Because the increase in data that will be used for these applications 
will be magnitudes bigger, simple rules or manual plausibility checks are no longer 
applicable. 


Smart data is often used by industrial stakeholders to emphasize that an industrial 
business user needs refined data—not necessarily all raw data (big data)—but 
without losing information by concentrating only on small data that is of relevance 
today. In cyber-physical systems as opposed to online businesses, there is inform- 
ation and communication technology (ICT) embedded in the entire system instead 
of only in the enterprise IT backend. Infrastructure operators have the opportunity 
to pre-process data in the field, aggregate data, and distribute the intelligence for 
data analytics along the entire ICT infrastructure to make the best use of computing 
and communication resources to deal with volume and velocity of mass sensor data. 


Decision support and automation becomes a core need as the pace and structure 
of business changes. European grid operators today need to intervene almost daily 
to prevent potentially large-scale blackouts, e.g. due to integration of renewables or 
liberalized markets. Traffic management systems become more and more elaborate 
as the amount of digitized and controllable elements increase. Business users need 
more information than “something is wrong”. Visualizations can be extremely 
useful, but the question of what needs to be done remains to be answered either 
in real-time or in advance of an event, i.e. in a predictive manner. 


Scalable advanced analytics will push the envelope on the state of the art. For 
example, smart metering data analytics (Picard 2013) include segmentation based 
on load curves, forecasting on local areas, scoring for non-technical losses, pattern 
recognition within load curves, predictive modelling, and real-time analytics in a 
fast and reliable manner in order to control delicate and complex systems such as 
the electricity grid (Heyde et al. 2010). In the US transportation sector, the business 
value of scalable real-time analytics is already being reaped by using big data 
systems for full-scale automation applications, e.g. automated rescheduling that 
helps trains to dynamically adapt to events and be on time across a wide area.' 

Big data analytics offer many improvements for the end users. Operational 
efficiency ultimately means energy and resource efficiency and timeliness, which 
will improve quality of life—especially in urban mobility settings. 


Customer experience and new business models related to big data scenarios are 
entirely based on better serving the end user of energy and mobility. However, both 
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scenarios need personalized data in higher resolution. There is significant value in 
cross-combining a variety of data, which on the downside can make pseudonym- 
ization or even anonymization ineffective in protecting the identity and behavioural 
patterns of individuals, or the business patterns and the strategies of companies. 
New business models based on monetizing the collected data, with currently 
unclear regulations, leave end users entirely uninformed, and unprotected against 
secondary use of their data for purposes they might not agree with, e.g. insurance 
classification, credit rating, etc. 


Reverse transparency is at the top of the wish list of data-literate end users. Data 
analytics need to empower end users to grasp the usage of their data trails. The 
access and usage of an end users’ data should become efficiently and dynamically 
configurable by the end users. End users need practical access to information on 
what data is used by whom, and for what purpose in an easy-to-use, manageable 
way. Rules and regulations are needed for granting such transparency for end users. 


Data access, exchange, and sharing for both business and end users. In today’s 
complex electricity or intermodal mobility markets, there is almost no scenario 
where all the required data for answering a business, or engineering, question 
comes from one department’s databases. Nonetheless, most of the currently 
installed advanced metering infrastructures have a lock-in of the acquired energy 
usage data to the utilities’ billing systems. The lock-in makes it cumbersome to use 
the energy data for other valuable analytics. These data silos have traditional roots 
from when most European infrastructure businesses were vertically integrated 
companies. Also, the amount of data to be exchanged was much less, such that 
interfaces, protocols, and processes for data exchange were rather rudimentary. 


13.4 Potential Big Data Applications for the Energy 
and Transport Sectors 


In the pursuit of collecting the many sectorial requirements towards a European big 
data economy and its technology roadmap, big data applications in energy and 
transport have been analysed. A finding that is congruent with Gartner’s study on 
the advancement of analytics (Kart 2013) is that big data applications can be 
categorized as “operational efficiency”, “customer experience”, and “new business 
models”. 

Operational efficiency is the main driver (Kart 2013) behind the investments for 
digitization and automation. The need for operational efficiency is manifold, such 
as increasing revenue margins, regulatory obligations, or coping with the loss of 
retiring skilled workers. Once pilots of big data technologies are set up to analyse 
the masses of data for operational efficiency purposes, the businesses realize that 
they are building a digital map of their businesses, products, and infrastructures— 
and that these maps combined with a variety of data sources can also deliver 
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additional insight in other areas of the business such as asset conditions, end usage 
patterns, etc. 

The remainder of this section details big data scenarios and the key challenge 
that prevents these scenarios from uptake in Europe. 


13.4.1 Operational Efficiency 


Operational efficiency subsumes all use cases that involve improvements in main- 
tenance and operations in real time, or in a predictive manner, based on the data 
which comes from infrastructure, stations, assets, and consumers. Technology 
vendors who developed the sensorization of the infrastructure are the main 
enablers. The market demand for enhanced technologies is increasing, because it 
helps the businesses in the energy sector to better manage risk. The complexity of 
the pan-European interconnected electricity markets, with the integration of renew- 
ables and liberalization of electricity trading, requires more visibility of the under- 
lying system and of the energy flows in real time. As a rule of thumb, anything with 
the adjective “smart” falls into this category: smart grid, smart metering, smart 
cities, and smart (oil, gas) fields. Some examples of big data use cases in operational 
efficiency are as follows: 


e Predictive and real-time analysis of disturbances in power systems and cost- 
effective countermeasures. 

e Operational capacity planning, monitoring, and control systems for energy 
supply and networks, dynamic pricing. 

¢ Optimizing multimodal networks in energy as well as transportation especially 
in urban settings, such as city logistics or eCar-sharing for which the energy 
consumption and feed-in to the transportation hubs could be cross-optimized 
with logistics. 


All of the scenarios in this category have the main big challenge of the 
connecting of data silos: be it across departments within vertically integrated 
companies, or across organizations along the electricity value chain. The big data 
use cases in the operational efficiency scenario require seamless integration of data, 
communication, and analytics across a variety of data sources, which are owned by 
different stakeholders. 


13.4.2 Customer Experience 


Understanding big data opportunities regarding customer needs and wants is espe- 
cially interesting for companies in liberalized consumerized markets such as 
electricity, where entry barriers for new players as well as the margins are 


13 Big Data in the Energy and Transport Sectors 231 


decreasing. Customer loyalty and continuous service improvement is what enables 
energy players to grow in these markets. 
Some examples of using big data to improve customer experience are as follows: 


e Continuous service improvement and product innovation, e.g. individualized 
tariff offerings based on detailed customer segmentation using smart meter or 
device-level consumption data. 

e Predictive lifecycle management of assets, i.e. data from machines and devices 
combined with enterprise resource planning and engineering data to offer ser- 
vices such as intelligent on-demand spare-parts logistics. 

¢ Industrial demand-side management, which allows for energy efficient produc- 
tion and increases competitiveness of manufacturing businesses. 


The core challenge is handling confidentiality and privacy of domestic and 
business customers while getting to know and anticipate their needs. The data 
originator, data owner, and data user are different stakeholders that need to collabo- 
rate and share data to realize these big data application scenarios. 


13.4.3 New Business Models 


New business models revolve around monetizing the available data sources and 
existing data services in new ways. There are quite a few cases in which data 
sources or analysis from one sector represents insights for stakeholders within 
another sector. An analysis of energy and mobility data start-ups shows that there 
is a whole new way of generating business value if the end user owns the resources. 
Then the business is entirely customer- and service-oriented; whereas the infra- 
structures of energy and transport with their existing stakeholders are utilized as 
part of the service. These are called intermediary business models. 

Energy consumer segment profiles, such as prosumer profiles for power feed-in 
from photovoltaic, or combined heat and power units; or actively managed demand- 
side profile, etc., from metering service providers could also be offered for smaller 
energy retailers, network operators, or utilities who can benefit from improvements 
on the standard profiles of energy usage but do not yet have access to high 
resolution energy data of their own customers. 

The core challenge is to provide clear regulation around the secondary use of 
energy and mobility data. The connected end user is the minimal prerequisite for 
these consumer-focused new business models. The new market segments are 
diversified through big data energy start-ups like Next: Kraftwerke, who “merge 
data from various sources such as operational data from our virtual power plant, 
current weather and grid data as well as live market data. This gives Next 
Kraftwerke an edge over conventional power traders” (Kraftwerke 2014). 

In transportation, cars are parked 95 % of the time (Barter, 2013) and according 
to a recent study, one car-sharing vehicle displaces 32 new vehicle purchases 
(AlixPartners 2014). Businesses that previously revolved around the product now 
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become all about data-driven services. On the contrary to the energy sector, this 
bold move shows the readiness of the transportation incumbents to seize the big 
data value potential of a data-driven business. 


13.5 Drivers and Constraints for Big Data in Energy 
and Transport 


13.5.1 Drivers 


The key drivers in the energy and transport sectors are as follows: 


¢ Efficiency increase of the energy and transportation infrastructure and associ- 
ated operations. 

e Renewable energy sources have transformed whole national energy policies, 
e.g. the German “Energiewende”’. Renewable energy integration requires optimi- 
zation on multiple fronts (e.g., grid, market, and end usage or storage) and 
increases the dependability of electrification on weather and weather forecasts. 

¢ Digitization and automation can substantially increase efficiency in the oper- 
ation of flow networks such as in electricity, gas, water, or transport networks. 
These infrastructure networks will become increasingly sensorized, which adds 
considerably to the volume, velocity, and variety of industrial data. 

¢ Communication and connectivity is needed to collect data for optimization and 
control automation. There needs to be bidirectional and multidirectional 
connectivity between field devices, e.g. intelligent electronic devices in an 
electricity grid substation or traffic lights. 

¢ Open data: Publication of operational data on transparency platforms” by grid 
network operators, by the energy exchange market, and by the gas transmission 
system operators is a regulatory obligation that fosters grass-roots projects. 
Open Weather Map? and Open Street Map* are examples of user-generated 
free of charge data provisioning which are very important for both sectors. 

¢ The “skills shift”: As a result of retiring of skilled workers, such as truck drivers 
or electricity grid operators, a know-how shortage is being created that needs to 
be filled fast. This directly translates to increasing prices for the customers, 
because higher salaries need to be paid to attract the few remaining skilled 
workers in the market.” In the mid to long term, efficiency increases and more 


? www.entsoe.net, www.transparency.eex.com, http://www.gas-roads.eu/ 
3 http://openweathermap.org/ 
4 http://www.openstreetmap.org/ 
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automation will be the prevailing trends: such as driverless trucks® in transport- 
ation or wide area monitoring protection and control systems in energy. 


13.5.2 Constraints 


Constraints in the energy and transport sectors are as follows: 


Skills: There are comparatively few people who can apply big data management 
and analytics knowledge together with domain know-how within the sectors. 
Interpretation: Implicit or tacit models are in the heads of the (retiring) skilled 
workers. Scalable domain model extraction becomes key, e.g. in traffic manage- 
ment systems rule bases grow over years to unmanageable complexities. 
Digitization has not yet reached the tipping point: Digitization and auto- 
mation of infrastructure requires upfront investments, which are not well consi- 
dered, if at all, by the incentive regulation by which infrastructure operators are 
bound. Real-time higher-resolution data is still not widely available. 
Uncertainty regarding digital rights and data protection laws: Unclear views 
on data ownership hold back big data in the end user facing segments of the 
energy and transport sectors (e.g. smart metering infrastructure). 

“Digitally divided” European union: Europe has fragmented jurisdiction when 
it comes to digital rights. 

‘“‘Business-as-usual” trumps “data-driven business”: In established busi- 
nesses it is very hard to change running business value chains. Incumbents 
will need to deal with a lot of changes: change in the existing long innovation 
cycles, change to walled garden views of closed systems and silos, and a change 
in the mind-set so that ICT becomes an enabler if not a core competency in their 
companies. 

Missing end user acceptance: In the energy sector it is often argued that people 
are not interested in energy usage data. However, when missing end user 
acceptance of a technology is argued, it is more a statement that a useful service 
using this technology is not yet deployed. 

Missing trust: Trust is an issue that could and should be remedied with 
technology data protection and with regulatory framework (i.e., appropriate 
privacy protection laws). 


ç http://www.techhive.com/article/2046262/the-first-driverless-cars-will-actually-be-a-bunch-of- 
trucks.html 
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13.6 Available Energy and Transport Data Resources 


As the potential for big data was explored within the two sectors, the clearer it 
became that the list of available data sources will grow and still not be exhaustive. 
A key observation is that the variety of data sources utilized to find an answer to a 
business or engineering question is the differentiator from business-as-usual. 


Infrastructure data includes power transmission and distribution lines, and 
pipelines for oil, gas, or water. In transportation, infrastructure consists of 
motorways, railways, air and seaways. The driving question is capacity. Is a 
road congested? Is a power line overloaded? 

Stations are considered part of the infrastructure. In business and engineering 
questions they play a special role as they include the main assets of an infra- 
structure in a condensed area, and are of high economic value. The main driving 
question is current status and utilization levels, i.e. the effective capacity of the 
infrastructure. Is a transmission line open or closed? Is it closed due to a fault on 
the line? Is a subway delayed? Is it due to a technical difficulty? 
Time-stamped and geo-tagged data are required and increasingly available, 
especially GPS-synchronized data in both sectors, but also GSM data for tracing 
mobility and extracting mobility patterns. 

Weather data, besides geo-location data, is the most used data source in both 
sectors. Most energy consumption is caused by heating and cooling, which are 
highly weather-dependent consumption patterns. With renewable energy 
resources power feed-in into the electrical grid becomes weather dependent. 
Usage data and patterns, indicators, and derived values of end usage of the 
respective resource and infrastructure, in both energy or transport, can be 
harvested by many means, e.g. within the smart infrastructure, via metering at 
stations at the edges of the network, or smart devices. 

Behavioural patterns both .affect energy usage and mobility patterns and can 
be predicted. Ethical and social aspects become a major concern and stumbling 
block. The positive effects such as better consumer experience, energy effi- 
ciency, more transparency, and fair pricing must be weighed against the negative 
side effects. 

Data sources in the horizontal IT landscape, including data coming from 
sources such as CRM tools, accounting software, and historical data coming 
from ordinary business systems. The value potential from cross-combining 
historical data with new sources of data which come from the increased digiti- 
zation and automation in energy and transportation systems is high. 

Finally a myriad of external third-party data or open data sources are 
important for big data scenarios in energy and transport sectors, including 
macro-economic data, environmental data (meteorological services, global 
weather models/simulation), market data (trading info, spot and forward, busi- 
ness news), human activity (web, phone, etc.), energy storage information, 
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geographic data, predictions based on Facebook and Twitter, and information 
communities such as Open Energy Information.’ 


13.7 Energy and Transport Sector Requirements 


The analysed business user and end user needs, as well as the different types of data 
sharing needs directly translate into technical and non-technical requirements. 


13.7.1 Non-technical Requirements 


Several non-technical requirements in the sectors were identified: 


Investment in communication and connectedness: Broadband communi- 
cation, or ICT in general, needs to be widely available across all of Europe 
and alongside energy and transportation infrastructure for real-time data access. 
Connectedness needs to extend to end users to allow them to be continuously 
connected. 

A digitally united European union: Roaming costs have been preventing 
European end users using data-intensive apps across national borders. 
European data-driven service providers—especially start-ups looking for scal- 
ability of their business models—have mainly focused on the US market, and not 
the 27 other EU member states due to different data-related regulations. 
European stakeholders require reliable minimally consistent rules and regu- 
lations regarding digital rights and regulations. A digital bill of rights® as called 
for by the inventor of the Web, Tim Berners-Lee, is globally the right move and 
should be supported by Europe. 

A better breeding ground for start-ups and start-up culture is required, 
especially for techno-economic paradigm shifts like big data and the spreading 
digitization, where new business widely deviates from business-as-usual. 
Energy and mobility start-ups require more than just financial investments but 
also freedom for exploration and experimentation with data. Without this free- 
dom innovation has little chance, unless of course the aforementioned tech- 
niques for privacy preserving analytics are not feasible. 

Open data in this regard is a great opportunity; however, standardization is 
required. Practical migration paths are required to simplify the adoption of state- 
of-the-art standards. Data model and representation standards will enable the 


7 http://en.openei.org 
8 http://www.wired.com/2014/03/web25 
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growth of a data ecosystem with collaborative data mining, shareable granularity 
of data, and accompanying techniques that prevent de-anonymization. 

Data skilled people: Programming, statistics, and associated tools need to be a 
part of engineering education. Traditional data analysts need to grasp the 
distributed computing paradigm, e.g. how to design algorithms that run on 
massively parallel systems, how to move algorithms to data, or how to engineer 
entirely new breeds of algorithms. 


13.7.2 Technical Requirements 


Several technical requirements were identified in the sectors: 


Abstraction: from the actual big data infrastructure is required to enable (a) ease 
of use, and (b) extensibility and flexibility. The analysed use cases have such 
diverse requirements that there is no single big data platform or solution that will 
empower the future utility businesses. 

Adaptive data and system models are needed so that new knowledge extracted 
from domain and system analytics can be redeployed into the data analytics 
framework without disrupting daily business. The abstraction layer should 
accommodate plug-in adaptive models. 

Data interpretability must be assured without the constant involvement of 
domain experts. The results must be traceable and explainable. Expert and 
domain know-how must be blended into data management and analytics. 

Data analytics is required as part of every step from data acquisition to data 
usage. In data acquisition embedded in-field analytics can enhance the veracity 
of data and can support different privacy and confidentiality settings on the same 
data source for different data users, e.g. service providers. 

Real-time analytics is required to support decisions, which need to be made in 
ever-shorter time spans. In smart grid settings, near real-time dynamic control 
requires insights at the source of the data. 

Data lake is required in terms of low-cost off-the-shelf storage technology 
combined with the ability to efficiently deploy data models on demand 
(“schema-on-read”’), instead of the typical data warehouse solution of extract- 
transform-load (ETL). 

Data marketplaces, open data, data logistics, standard protocols capable of 
handling the variety, volume, and velocity of data, as well as data platforms are 
required for data sharing and data exchange across organizational boundaries. 
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13.8 Technology Roadmap for the Energy and Transport 
Sectors 


The big data value chain for infrastructure- and resource-centric systems of energy 
and transport businesses consists of three main phases: data acquisition, data 
management, and data usage. Data analytics, as indicated by business user needs, 
is implicitly required within all steps and is not a separate phase. 

The technology roadmap for fulfilling the key requirements along the data value 
chain for the energy and transport sectors focuses on technology that is not readily 
available and needs further research and development in order to fulfil the more 
strict requirements of energy and transport applications (Fig. 13.1). 


13.8.1 Data Access and Sharing 


Energy and transport are resource-centric infrastructure businesses. Access to usage 
data creates the opportunity to analyse the usage of a product or service to improve 
it, or gain efficiency in sales and operations. Usage data needs to be combined with 
other available data to deliver reliable predictive models. Currently there is a trade- 
off between enhancing interpretability of data and preserving privacy and confi- 
dentiality. The following example of mobility usage data combined with a variety 
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of other data demonstrates the privacy challenge. de Montjoye et al. (2013) show 
that “4 spatio-temporal points (approximate places and times) are enough to 
uniquely identify 95 % of 1.5 M people in a mobility database. The study further 
states that these constraints hold even when the resolution of the dataset is low”. 
The work shows that mobility datasets combined with metadata can circumvent 
anonymity. 

At the same time, insufficient privacy protection options can hinder the sourcing 
of big data in the first place, as experiences from smart metering rollouts in the 
energy businesses show. In the EU only 10 % of homes have smart meters (Nunez 
2012). Although there is a mandate that the technology reaches 80 % of homes by 
2020, European rollouts are stagnant. A survey from 2012 (Department of Energy 
and Climate Change 2012) finds that “with increasing reading frequency, i.e. from 
monthly to daily, to half hourly, etc., energy consumption data did start to feel more 
sensitive as the level of detail started to seem intrusive... Equally, it was not clear 
to some [participants] why anyone would want the higher level of detail, leaving a 
gap to be filled by speculation which resulted in some [participants] becoming more 
uneasy”. 

Advances are needed for the following technologies for data access and sharing: 


¢ Linked data is a lightweight practice for exposing and connecting pieces of 
data, information, or knowledge using basic web standards. It promises to open 
up siloed data ownership and is already an enabler of open data and data sharing. 
However, with the increasing number of data sources already linked, the various 
types of new data that will come from intelligent infrastructures, and always 
connected end users in energy and mobility, scalability and cost-efficacy 
becomes an issue. One of the open research questions is how to (semi-) 
automatically extract data linkage to increase current scalability. 

¢ Encrypted data storage can enable integrated, data-level security. As cloud 
storage becomes commonplace for domestic and commercial end users, better 
and more user-friendly data protection becomes a differentiation factor (Tanner 
2014). In order to preserve privacy and confidentiality the use of encrypted data 
storage will be a basic enabler of data sharing and shared analytics. However, 
analytics on encrypted data is still an ongoing research question. The most 
widely pursued research is called fully homomorphic encryption. Homomorphic 
encryption theoretically allows operations to be carried out on the cipher text. 
The result is a cipher text that when decrypted matches the result of operation on 
plaintext. Currently only basic operations are feasible. 

¢ Data provenance is the art of tracking data through all transformations, ana- 
lyses, and interpretations. Provenance assures that data that is used to create 
actionable insights are reliable. The metadata that is generated to realize prove- 
nance across the big variety of datasets from differing sources also increases 
interpretability of data, which in turn could improve automated information 
extraction. However, scaling data provenance across the dimensions of big 
data is an open research question. 
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¢ Differential privacy (Dwork and Roth 2014) is the mathematically rigorous 
definition of privacy (and its loss) with the accompanying algorithms. The 
fundamental law of information recovery (Dwork and Roth 2014) states that 
too many queries with too few errors will expose the real information. The 
purpose of developing better algorithms is to push this event as far away as 
possible. This notion is very similar to the now mainstream realization that there 
is no unbreakable security, but that barriers if broken need to be fixed and 
improved. The cutting-edge research on differential privacy considers distri- 
buted databases and computations on data streams, enabling linear scalability 
and real-time processing for privacy preserving analytics. Hence, this technique 
could be an enabler of privacy preserving analytics on big data, allowing big data 
to gain user acceptance in mobility and energy. 


13.8.2 Real-Time and Multi-dimensional Analytics 


Real-time and multi-dimensional analytics enable real-time, multi-way analysis of 
streaming, spatiotemporal energy, and transport data. Examples from dynamic 
complex cyber-physical systems such as power networks show that there is a 
clear business mandate. Global spending on power utility data analytics is forecast 
to top $20 billion over the next 9 years, with an annual spend of $3.8 billion globally 
by 2020 (GTM Research 2012). However cost-efficacy of the required technologies 
needs to be proven. Real-time monitoring does not justify the cost if actions cannot 
be undertaken in real time. Phasor measurement technology, enabling high- 
resolution views of the current status of power networks in real time, is a techno- 
logy that was invented 30 years ago. Possible applications have been researched for 
more than a decade. Initially there was no business need for it, because the power 
systems of the day were well engineered and well structured, hierarchical, static, 
and predictable. With increased dynamics through market liberalization and the 
integration of power generation technology from intermittent renewable sources 
like wind and solar, real-time views of power networks becomes indispensable. 
Advances are needed for the following technologies: 


¢ Distributed stream computing is currently gaining traction. There are two 
different strains of research and development of stream computing: (1) stream 
computing as in complex event processing (CEP), which has had its main focus 
on analysing data of high-variety and high-velocity, and (2) distributed stream 
computing, focusing on high-volume and high-velocity data processing. 
Complementing the missing third dimension, volume and variety, respectively, 
in both strains is the current research direction. It is argued that distributed 
stream computing, which already has linear scalability and real-time processing 
capabilities, will tackle high-variety data challenges with semantic techniques 
(Hasan and Curry 2014) and Linked data. A further open question is how to ease 
development and deployment for the algorithms that make use of distributed 
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stream computing as well as other computing and storage solutions, such as plain 
old data warehouses and RDBMS. Since cost-effectiveness is the main enabler 
for big data value, advanced elasticity with computing and storage on demand as 
the algorithm requires must also be tackled. 

e Machine learning is a fundamental capability needed when dealing with big 
data and dynamic systems, where a human could not possibly review all data, or 
where humans just lack the experience or ability to be able to define patterns. 
Systems are becoming increasingly more dynamic with complex network 
effects. In these systems humans are not capable of extracting reliable cues in 
real time—but only in hindsight during post-mortem data analysis (which can 
take significant time when performed by human data scientists). Deep learning, a 
research field that is gaining momentum, concentrates on more complex 
non-linear data models and multiple transformations of data. Some represent- 
ations of data are better for answering a specific question than others, meaning 
multiple representations of the same data in different dimensions may be 
necessary to satisfy an entire application. The open questions are: how to 
represent specific energy and mobility data, possibly in multiple dimensions— 
and how to design algorithms that learn the answers to specific questions of the 
energy and mobility domains better than human operators can—and do so in a 
verifiable manner. The main questions for machine learning are cost-effective 
storage and computing for massive amounts of high-sampled data, the design of 
new efficient data structures, and algorithms such as tensor modelling and 
convolutional neural networks. 


13.8.3 Prescriptive Analytics 


Prescriptive analytics enable real-time decision automation in energy and mobility 
systems. The more complex and dynamic the systems are becoming, the faster 
insights from data will need to be delivered to enhance decision-making. With 
increasing ICT installed into the intelligent infrastructures of energy and transport, 
decision automation becomes feasible. However, with the increasing digitization, 
the normal operating state, when all digitized field devices deliver actionable 
information on how to operate more efficiently, will overwhelm human operators. 
The only logical conclusion is to either have dependable automated decision 
algorithms, or ignore the insights per second that a human operator cannot reason- 
ably handle at the cost of reduced operational efficiency. 
Advances are needed for the following technologies: 


¢ Prescriptive analytics: Technologies enabling real-time analytics are the basis 
for prescriptive analytics in cyber-physical systems with resource-centric infra- 
structures such as energy and transport. With prescriptive analytics the simple 
predictive model is enhanced with possible actions and their outcomes, as well 
as an evaluation of these outcomes. In this manner, prescriptive analytics not 
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only explains what might happen, but also suggests an optimal set of actions. 
Simulation and optimization are analytical tools that support prescriptive 
analytics. 

e Machine readable engineering and system models: Currently many system 
models are not machine-readable. Engineering models on the other hand are 
semi-structured because digital tools are increasingly used to engineer a system. 
Research and innovation in this area of work will assure that machine learning 
algorithms can leverage system know-how that today is mainly limited to 
humans. Linked data will facilitate the semantic coupling of know-how at design 
and implementation time, with discovered knowledge from data at operation 
time, resulting in self-improving data models and algorithms for machine learn- 
ing (Curry et al. 2013). 

¢ Edge computing: Intelligent infrastructures in the energy and mobility sectors 
have ICT capability built-in, meaning there is storage and computing power 
along the entire cyber-physical infrastructure of electricity and transportation 
systems, not only in the control rooms and data centres at enterprise-level. 
Embedded analytics, and distributed data analytics, facilitating the in-network 
and in-field analytics (sometimes referred to as edge-computing) in conjunction 
with analytics carried out at enterprise-level, will be the innovation trigger in 
energy and transport. 


13.8.4 Abstraction 


Abstraction from the underlying big data technologies is needed to enable ease of 
use for data scientists, and for business users. Many of the techniques required for 
real-time, prescriptive analytics, such as predictive modelling, optimization, and 
simulation, are data and compute intensive. Combined with big data these require 
distributed storage and parallel, or distributed computing. At the same time many of 
the machine learning and data mining algorithms are not straightforward to 
parallelize. A recent survey (Paradigm 4 2014) found that “although 49 % of the 
respondent data scientists could not fit their data into relational databases anymore, 
only 48 % have used Hadoop or Spark—and of those 76 % said they could not work 
effectively due to platform issues”. 

This is an indicator that big data computing is too complex to use without 
sophisticated computer science know-how. One direction of advancement is for 
abstractions and high-level procedures to be developed that hide the complexities of 
distributed computing and machine learning from data scientists. The other direc- 
tion of course will be more skilled data scientists, who are literate in distributed 
computing, or distributed computing experts becoming more literate in data science 
and statistics. Advances are needed for the following technologies: 


¢ Abstraction is a common tool in computer science. Each technology at first is 
cumbersome. Abstraction manages complexity so that the user (e.g., 
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programmer, data scientist, or business user) can work closer to the level of 
human problem solving, leaving out the practical details of realization. In the 
evolution of big data technologies several abstractions have already simplified 
the use of distributed file systems by extracting SQL-like querying languages to 
make them similar to database, or by adapting the style of processing to that of 
familiar online analytical processing frameworks. 

¢ Linked data is one state-of-the-art enabler for realizing an abstraction level over 
large-scale data sources. The semantic linkage of data without prior knowledge 
and continuously linking with discovered knowledge is what will allow scalable 
knowledge modelling and retrieval in a big data setting. A further open question 
is how to manage a variety of data sources in a scalable way. Future research 
should establish a thorough understanding of data type agnostic architectures. 


13.9 Conclusion and Recommendations for the Energy 
and Transport Sectors 


The energy and transport sectors, from an infrastructure perspective as well as from 
resource efficiency, global competitiveness, and quality of life perspectives, are 
very important for Europe. 

The analysis of the available data sources in energy as well as their use cases in 
the different categories of big data value, operational efficiency, customer experi- 
ence, and new business models helped in identifying the industrial needs and 
requirements for big data technologies. In the investigation of these requirements, 
it becomes clear that a mere utilization of existing big data technologies as employed 
by online data businesses will not be sufficient. Domain- and device-specific adapt- 
ations for use in cyber-physical energy and transport systems are necessary. Inno- 
vation regarding privacy and confidentiality preserving data management and 
analysis is a primary concern of the energy and transport sector stakeholders. Without 
satisfying the need for privacy and confidentiality there will always be regulatory 
uncertainty, and uncertainty regarding user acceptance of a new data-driven offering. 

Among the energy and transport sector stakeholders, there is a sense that “big 
data” will not be enough. The increasing intelligence embedded in infrastructures 
will be able to analyse data to deliver “smart data”. This seems to be necessary, 
since the analytics involved will require much more elaborate algorithms than for 
other sectors. In addition, the stakes in energy and transport big data scenarios are 
very high, since the optimization opportunities will affect critical infrastructures. 

There are a few examples in the energy and transport sectors, where a techno- 
logy for data acquisition, i.e. a smart device, has been around for many years, or that 
the stakeholders have already been measuring and capturing a substantial amount of 
data. However the business need was unclear, making it difficult to justify invest- 
ment. With recent advances it is now possible for the data to be communicated, 
stored, and processed cost-effectively. Hence, some stakeholders run the danger of 
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not acknowledging the technology push. On the other hand, unclear regulation on 
what usage is allowed with the data keeps them from experimenting. 

Many of the state-of-the-art big data technologies just await adaptation and 
usage in these traditional sectors. The technology roadmap identifies and elaborates 
the high-priority requirements and technologies that will take the energy and 
transport sectors beyond state of the art, such that they can concentrate on gener- 
ating value by adapting and applying those technologies within their specific 
application domains and value-adding use cases. 
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Chapter 14 
Big Data in the Media and Entertainment 
Sectors 


Helen Lippell 


14.1 Introduction 


The media and entertainment industries have frequently been at the forefront of 
adopting new technologies. The key business problems that are driving media 
companies to look at big data capabilities are the need to reduce the costs of operating 
in an increasingly competitive landscape and, at the same time, the need to generate 
revenue from delivering content and data through diverse platforms and products. 

It is no longer sufficient merely to publish a daily newspaper or broadcast a 
television programme. Contemporary operators must drive value from their assets 
at every stage of the data lifecycle. The most nimble media operators nowadays 
may not even create original content themselves. Two of the biggest international 
video streaming services, Netflix and Amazon, are largely aggregators of others’ 
content, though also offering originally commissioned content to entice new and 
existing subscribers. 

Media industry players are more connected with their customers and competitors 
than ever before. Thanks to the impact of disintermediation, content can be generated, 
shared, curated, and republished by literally anyone with an Internet-enabled device. 
Global revenues from such devices, including smartphones, tablets, desktop PCs, TVs, 
games consoles, e-readers, wearable gadgets, and even drones were expected to be 
around $750 billion in 2014 (Deloitte 2014). This means that the ability of big data 
technology to ingest, store, and process many different data sources, and in real-time, 
is a valuable asset to the companies who are prepared to invest in it. 

The Media Sector is in many respects an early adopter of big data technologies, 
but much more evolution has to happen for the full potential to be realized. Better 
integration between solutions along the data value chain will be essential in order to 
convince decision-makers to invest in innovation, especially in times of economic 
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uncertainty. Also, the solutions market is dominated by US, and, increasingly, 
Asian firms. Therefore, there is an economic imperative for Europe to both develop 
and use big data technologies more extensively. Media and entertainment content 
and platforms have a global reach that many companies in other sectors, even retail 
and manufacturing, would be envious of. 

Case studies of successful big data projects in media have tended to come from 
the left-hand end of the data value chain (i.e. data acquisition and analysis). 
However, there is a need to identify both exemplars and gaps in the curation and 
usage of big data, as these are significant areas of competitive advantage for media 
organizations. Big data contributes to the bottom line by enabling organizations to 
pursue digital transformation. According to PWC (2014), this forges the trust of 
consumers, creates the confidence to innovate with speed and agility, and 
empowers innovation. 

Unlike some other sectors, the vast majority of actionable data in the media 
sector is already in digital form (and analogue products such as newspapers have 
been created through digital technologies for some years now). However, this does 
not mean that organizations are deriving the fullest possible financial benefit or cost 
efficiencies from both their existing data and new sources of data. There is a 
growing body of evidence that there is much work to do at research and policy 
levels to support the burgeoning ecosystem of diverse businesses engaged in 
analysing, enhancing, and delivering content and data. 


14.2 Analysis of Industrial Needs in the Media 
and Entertainment Sectors 


The media sector has always generated data, whether from research, sales, customer 
databases, log files, and so on. Equally, the vast majority of publishers and broad- 
casters have always faced the need to compete right from the earliest days of 
newspapers in the eighteenth century. Even government or publicly funded media 
bodies have to continually prove their relevance to their audiences, in order to stay 
relevant in a world of extensive choice and to secure future funding. But the big 
data mind-set, technical solutions, and strategies offer the ability to manage and 
disseminate data at speeds and scales that have never been seen before. 

There are three main areas where big data has the potential to disrupt the status 
quo and stimulate economic growth within the media and entertainment sectors: 


1. Products and Services: Big data-driven media businesses have the ability to 
publish content in more sophisticated ways. Human expertise in, e.g., curation, 
editorial nous, and psychology can be complemented with quantitative insights 
derived from analysing large and heterogeneous datasets. But this is predicated on 
big data analysis tools being easy to use for data scientists and business users alike. 

2. Customers and Suppliers: Ambitious media companies will use big data to find 
out more about their customers—their preferences, profile, attitudes—and they 
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will use that information to build more engaged relationships. With the tools of 
social media and data capture now widely available to more or less anyone, 
individuals are also suppliers of content back to media companies. Many 
organizations now back social media analysis into to their orthodox journalism 
processes, so that consumers have a richer, more interactive relationship with 
news stories. Without big data applications, there will be a wasteful and random 
approach to finding the most interesting content. 

3. Infrastructure and Process: While start-ups and SMEs can operate efficiently 
with open source and cloud infrastructure, for larger, older players, updating 
legacy IT infrastructure is a challenge. Legacy products and standards still need 
to be supported in the transition to big data ways of thinking and working. 
Process and organizational culture may also need to keep pace with the expec- 
tations of what big data offers. Failure to transform the culture and skillset of 
staff could impact companies who are profitable today but cannot adapt to data- 
driven business models. 


14.3 Potential Big Data Applications for the Media 
and Entertainment Sectors 


Six application scenarios for the media sector were described and further developed 
in Zillner et al. (2013, 2014a). All of these scenarios represent tangible business 
models for organizations; however, without support from big data technologies, 
companies will not be able to mature their existing pilots or small-scale projects 
into future revenue opportunities (Table 14.1). 


14.4 Drivers and Constraints for Big Data in Media 
and Entertainment Sectors 


Like all businesses, media companies aim to maximize revenue, minimize costs, 
and improve decision-making and business processes. 


14.4.1 Drivers 


Specific to the media and entertainment sectors though are the following drivers: 


¢ Aim to understand customers on a very detailed level, often by analysing many 
different types of interaction (e.g. product usage, customer service interactions, 
social media, etc.). 
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Table 14.1 Summary of six application big data scenarios for the media sector 


Name Data journalism 

Summary Large volumes of data become available to a media organization. 

Synopsis Single or multiple datasets require analysis to derive insight, find interesting 
stories, and generate material. This can then be enhanced and ultimately 
monetized by selling to customers. 

Business — Improve quality of journalism and therefore enhance the brand 

objectives — Analyse data more thoroughly for less cost 
— Enable data analysis to be performed by a wider range of users 

Name Dynamic semantic publishing 

Summary Scalable processing of content for efficient targeting 

Synopsis Using semantic technologies to both produce and target content more 
efficiently 

Business — Manage content and scarce staff resources more efficiently 

objectives — Add value to data to differentiate services from competitors 

Name Social media analysis 

Summary Processing of large user-generated content datasets. 

Synopsis Batch and real-time analysis of millions of tweets, images, status updates to 
identify trends and content that can be packaged in value-added services. 

Business — Create value-added services for clients 

objectives — Perform large-scale data processing in a cost-effective manner 

Name Cross-sell of related products 

Summary Developing recommendation engines using multiple data sources. 

Synopsis Applications that exploit collaborative filtering, content-based filtering, and 
hybrids of both approaches. 

Business — Generate more revenue from customers 

objectives 

Name Product development 

Summary Using predictive analytics to commission new services 

Synopsis Data mining to support development of new and enhanced products for the 
marketplace 

Business — Offer innovative new products and services 

objectives — Enable development in a more quantitative way than is currently possible 

Name Audience insight 

Summary Using data from multiple sources to build up a comprehensive 360° view of a 
customer 

Synopsis Extension of scenario “Product Development”—mining of data external to the 
organization for information about customer habits and preferences 

Business — Reduce costs of customer retention and acquisition 

objectives — Use insights to aid commissioning of new products and services 


— Maximize revenue from customers 


¢ Operate in crowded sub-sectors such as digital marketing or book publishing, 
where very few players have dominance, and consumer preferences and fashions 
can change very rapidly. 
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Diversify service offerings wherever possible. Most significant European 
media companies operate in many areas, for example, newspaper publishers, 
websites, and commercial apps; or broadcasters may also sell broadband access. 
Communicate to build influence within society, e.g. politically. This is less 
tangible than just selling products but seen as equally important by media 
owners or governments. 


14.4.2 Constraints 


The constraints for big data in the media and entertainment sectors can be 
summarized as follows: 


Increased consumer awareness and concern about how personal data is being 
used. There is regulatory uncertainty for European businesses that handle per- 
sonal data, which potentially puts them at a disadvantage compared to, say, US 
companies who operate within a much more relaxed legal landscape. 
Insufficient access to finance for media start-ups and SMEs. While it is 
relatively easy to start a new company producing apps, games, or social net- 
works, it is much harder to scale up without committed investors. 

The labour market across Europe is not providing enough data professionals 
able to manipulate big data applications, e.g. for data journalism and product 
management. 

Fear of piracy and consumer disregard for copyright may disincentive 
creative people and companies from taking risks to launch new media and 
cultural products and services. 

Large US players dominate the content and data industry. Companies such as 
Apple, Amazon, and Google between them have huge dominance in many 
sub-sectors including music, advertising, publishing, and consumer media 
electronics. 

Differences in penetration of high-speed broadband provision across member 
countries, in cities, and in rural areas. This is a disincentive for companies 
looking to deliver content that requires high bandwidth, e.g. streaming movies, 
as it reduces the potential customer base. 


14.5 Available Media and Entertainment Data Resources 


Table 14.2 is intended to give a flavour of the data sources that most media 
companies routinely handle. One table lists some categories of data that are 
generated by the companies themselves, while the second shows third-party sources 
that are or can be processed by those in the media sector, depending on their 
particular line of business. 
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Each type of data source is matched to a key characteristic of big data. Custom- 
arily, the technology industry has talked of “the three Vs of big data”, that is, volume, 
variety, and velocity. Kobielus (2013) also discusses a fourth characteristic—verac- 
ity. This is important for the media sector because consumer products and services 
can quickly fail if the content lacks authoritativeness, or it is of poor quality, or it has 
uncertain provenance. According to IBM (2014), 27 % of respondents to a US 
survey were unsure even how much of their data was inaccurate—suggesting the 


scale of the problem is underestimated. 


Table 14.2 Media data resources mapped to “V” characteristics of big data 


Internally generated data 


Key “V” characteristic 


Consumer profile details including customer 
service interactions. 


Volume—Large amounts of data to be stored 
and potentially mined. Variety applies when 
considering the different ways customers may 
interact with a media service provider—and 
hence the opportunity for the business to “join 
up the dots” and better understand them. 


Network logging (e.g. for web or entertainment 
companies operating their own networks). 


Velocity—Network issues must be identified 
in real-time in order to resolve problems and 
retain consumer trust. 


Organizations own data services to end users. 


Characteristic(s) will depend on business 
objective of the data, e.g., a news agency will 
prioritize speed of delivery to customers, a 
broadcaster will be focused on streaming con- 
tent in multiple formats to multiple types of 
device. 


Consumer preferences inferred from sources 
including click stream data, product usage 
behaviour, purchase history, etc. 


Volume—Large amounts of data can be gath- 
ered. Velocity will become pertinent where the 
service needs to be responsive to user action, 
e.g., online gaming networks which upsell 
extra features to players. 


Third-party data 


Key “V” characteristic 


Commercial data feeds, e.g., sports data, press 
agency newswires. 


Velocity—Being first to use data such as 
sports or news events builds competitive 
advantage. 


Network information (where external networks 
are being used, e.g., messaging apps that pig- 
gyback on mobile networks). 


Velocity—Network issues must be identified 
in real-time in order to ensure continuity of 
service. 


Public sector open datasets. 


Veracity—Open data may have quality, prov- 
enance, and completeness issues. 


Free structured and/or linked data, e.g., 
Wikidata/DBpedia 


Veracity—crowdsourced data may have qual- 
ity, provenance, and completeness issues. 


Social media data, e.g., updates, videos, 
images, links, and signals such as “likes”. 


Volume, variety, velocity, and veracity— 
Media companies must prioritize processing 
based on expected use cases. As one example, 
data journalism requires a large volume of 
data to be prepared for analysis and interpre- 
tation. On the other hand, a media marketing 
business might be more concerned with the 
variety of social data across many channels. 
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14.6 Media and Entertainment Sector Requirements 


The Media and Entertainment Sectorial Forum were able to identify and name 
several requirements, which need to be addressed by big data application in the 
domain. The requirements are distinguish between non-technical and technical 
requirements. 


14.6.1 Non-technical Requirements 


It is important to note that the widespread uptake of big data within the media 
industry is not solely dependent on successful implementation of specific technol- 
ogies and solutions. In Zillner et al. (2014b), a survey was undertaken among 
European middle and senior managers from the media sector (and also the telecoms 
sector, where large players are increasingly moving into areas that were once 
considered purely the realm of broadcasters, publishers, etc.). Respondents were 
asked to rank several big data priorities based on how important they would be to 
their own organizations. 

It is striking that all survey participants identified the need for a European 
framework for shared standards, a clear regulatory landscape, and a collaborative 
ecosystem—implying that businesses are suffering from a lack of confidence in 
their ability to see through the hype and really get to grips with big data in their 
enterprises. Another area ranked as very important by a notable proportion of 
respondents was making solutions usable and attractive for business users 
(i.e. not just data scientists). 


14.6.2 Technical Requirements 


Table 14.3 lists 37 requirements that were distilled from the work of the Media 
Sector Forum. Each requirement is matched to a business objective (although of 
course in practice some requirements could meet more than one objective). The five 
columns at the right-hand side of the table place each requirement in its appropriate 
place(s) along the big data value chain. Media, as a mostly customer-facing, 
revenue-generating economic sector, has many critical needs in data curation and 
usage. 
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Table 14.3 Big data technical requirements of the media sector 
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Business 
Big data requirement objective Acquisition | Analysis | Curation | Storage | Usage 
Curate heterogeneous Improve X 
data sources in a content | business 
and origin agnostic processes 
manner 
Programmatically inter- | Improve 
rogate data for trends business 
processes 
Quickly start processing | Improve X 
new data types as they | business 
become needed processes 
Analyse unstructured Improve 
data with regard to sen- | business 
timent, topic, and other | processes 
intangible aspects of 
text 
Transform and augment | Improve X 
open data from the pub- | business 
lic sector with regard to | processes 
format, semantics, and 
quality 
Scalable tools for search | Improve 
and discovery business 
applications processes 
Visualize data for ana- | Improve 
lytics and metrics business 
(especially for business- | processes 
technical users) 
Automatically create Improve X 
and apply metadata to business 
datasets processes 
Quickly and accurately | Improve X 
process data in near decision- 
real-time making 
Apply models and Improve X 
ontologies to data to decision- 
extract relationships making 
Transform streams from | Improve X 
sensors into actionable | decision- 
views making 
Analytics tools which Improve 
enable powerful query- | decision- 
ing and manipulation by | making 
non-programmers or 
statisticians 
Inference engines to Improve X X 
analyse semantic graph | decision- 
data making 


(continued) 
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Business 
Big data requirement objective Acquisition | Analysis | Curation | Storage | Usage 
Derive value from pro- | Increase X 
prietary datasets revenue 
Derive value from pub- | Increase 
lic open datasets revenue 
Deliver tailored data Increase 
and content to revenue 
customers 
Human-centred editori- | Increase 
alizing of curated data | revenue 
streams 
Algorithms to crunch Increase 
data to produce more revenue 
interesting recommen- 
dations than “more of 
the same” 
Algorithm management | Increase 
tools for non-technical | revenue 
users 
Enrich multimedia con- | Increase 
tent such as images and | revenue 
videos with semantic 
metadata 
Blend user-generated Increase X 
content with commer- revenue 
cially produced media 
to create new digital 
products 
Generate insights from | Increase 
data to enable new revenue 
business models 
(e.g. cross-selling based 
on viewing habits) 
Increase conversions Increase 
from offline marketing | revenue 
activities (e.g. direct 
mail) by analysing 
online data 
Predictive analytics Increase 
solutions that can iden- | revenue 
tify trends, segments, 
and patterns without 
these explicitly being 
modelled 
Return more relevant Increase 
search results in revenue 


consumer-facing 


(continued) 
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Business 

Big data requirement objective Acquisition | Analysis | Curation | Storage | Usage 
applications using 

semantic analysis 

Database solutions that | Reduce 

can be set-up more costs 

quickly than with tradi- 

tional applications 

Capability to use Reduce 
crowdsourced data costs 

curation to complement 

internal subject matter 

expertise 

Manage large-scale data | Reduce 

in graph databases costs 

Translate unstructured Reduce X 
data (e.g. text or voice) | costs 

to one or many 

languages 

High-volume data Reduce X 
scraping and crawling costs 

tools 

Identify patterns in data | Understand 

to drive insights about customers 
consumer behaviour 

Take account of many Understand |X 
factors (e.g. location, customers 
device, user profile, 

usage context) to better 

target content delivery 

Connect data from all Understand |X 
customer interactions to | customers 

form a 360° view 

Ingest data from new Understand |X 
classes of device customers 

(e.g. wearables) 

Drill down into con- Understand 
sumer behaviour in customers 

more granular detail 

Foster a more engaged | Understand 
relationship with audi- | customers 

ences and customers 

through unstructured 

social data analysis 

Clear policy direction Understand X X 
on use of personal data | customers 


within the EU 


14 Big Data in the Media and Entertainment Sectors 255 


14.7 Technology Roadmap for Big Data in the Media 
and Entertainment Sectors 


Of all the sectors discussed in this book, media is arguably the one that changes 
most suddenly and most often. New paradigms can emerge extremely quickly and 
become commercially vital in a short space of time (e.g. Twitter was founded only 
in 2006 and now has a market capitalization of many billions of dollars). The year 
2015 onwards will see many media players and consumers alike experimenting 
with drones (more strictly, “unmanned aerial vehicles”, or UAVs) to see if captured 
footage can be monetized either directly as content or indirectly to attract 
advertising. 

Figure 14.2 and Table 14.4 consolidate the outcomes of the research completed 
in Zillner et al. (2013, 2014a), along with additional background research. Fig- 
ure 14.1 maps out the methodology used to derive the sector roadmap, showing how 
iterative engagement with industry supported at every stage the definition of the 
needs and technologies around big data for the media sector. 

Any roadmap must be cognisant of the risk that it will be out of date before it is 
even published. Nevertheless, the key headings shown in the figures in this section 
are strongly predicted to remain highly relevant to the sector for the following 
reasons. 


14.7.1 Semantic Data Enrichment 


Semantics is a long-established and now fast-developing field that is finally fulfill- 
ing its academic promise. Major media applications such as “intelligent personal 
assistants”, e.g. Siri and Cortana, are underpinned by “artificial intelligence” and 
semantic analysis technology. More development is needed to help commercial 
organizations in Europe exploit the potential of ontologies, graph databases, and 
curation platforms. 


14.7.2 Data Quality 


The key technological developments in this area include open data and data 
standards generally to aid interoperability. Also key are capabilities for processing 
unstructured (especially natural language) data streams. Finally, there is a need for 
back-end systems that can absorb different types of data with as little friction as 
possible, by minimizing the need to define data schemas upfront. 
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Fig. 14.2 Mapping requirements to research questions in the media sector 


14.7.3 Data-Driven Innovation 


Three key technologies underpinning the drive for high-quality innovation are 
machine learning at enterprise scale; the Internet of Things (IoT), which will 
exponentially increase the volume and diversity of data streams available to anyone 
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involved in media or data-driven storytelling; and finally, tools to better interpret 
customer interactions with products and services. 


14.7.4 Data Analysis 


Media and entertainment companies need to analyse data not only at the customer 
and product levels, but also at network and infrastructure levels (e.g. streaming 
video suppliers, Internet businesses, television broadcasters, and so on). Key 
technologies in the coming years will be descriptive analytics, more sophisticated 
customer relationship management solutions, and lastly data visualization solutions 
that are accessible to a wide range of users in the enterprise. It is only by “human- 
izing” these tools that big data will be able to deliver the benefits that data-driven 
businesses increasingly demand (Table 14.4). 


14.8 Conclusion and Recommendations for the Media 
and Entertainment Sectors 


Europe has much to offer in culture and content to the global market. European 
publishers and TV companies are globally renowned, but no EU-based competitor 
has emerged to the multinational giants of Google, Amazon, Apple, or Facebook. 
Differences between the European and US economies, such as ease of access to 
venture capital, would seem to preclude this happening. Therefore, the best way 
forward for Europe is to build on its strengths of creativity and free movement of 
people and services, in order to bring together communities of industrial players, 
researchers, and government to tackle the following priorities: 


e Making sense of data streams, whether text, image, video, sensors, and so 
on. Sophisticated products and services can be developed by extracting value 
from heterogeneous sources. 

¢ Exploiting big data step changes in the ability to ingest and process raw data, so 
as to minimize risks in bringing new data-driven offerings to market. 

e Curating quality information out of vast data streams, using algorithmic scalable 
approaches and blending them with human knowledge through curation 
platforms. 

e Accelerating business adoption of big data. Consumer awareness is growing and 
technical improvements continue to reduce the cost of storage and analytics 
tools among other things. Therefore, it is more important than ever that busi- 
nesses have confidence that they understand what they want from big data and 
that the non-technical aspects such as human resources and regulation are in 
place. 
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Part IV 
A Roadmap for Big Data Research 


Chapter 15 
Cross-sectorial Requirements Analysis 
for Big Data Research 


Tilman Becker, Edward Curry, Anja Jentzsch, and Walter Palmetshofer 


15.1 Introduction 


This chapter identifies the cross-sectorial requirements for big data research nec- 
essary to define a prioritized research roadmap based on expected impact. The aim 
of the roadmaps is to maximize and sustain the impact of big data technologies and 
applications in different industrial sectors by identifying and driving opportunities 
in Europe. The target audiences for the roadmaps are the different stakeholders 
involved in the big data ecosystem including industrial users of big data applica- 
tions, technical providers of big data solutions, regulators, policy makers, 
researchers, and end users. 

The first step toward the roadmap was to establish a list of cross-sectorial 
business requirements and goals from each of the industrial sectors covered in 
part of this book and in Zillner et al. (2014). The consolidated results comprise a 
prioritized set of cross-sector requirements that were used to define the technology, 
business, policy, and society roadmaps with action recommendations. This chapter 
presents a condensed version of the cross-sectorial consolidated requirements. It 
discusses each of the high-level and sub-level requirements together with the 
associated challenges that need to be tackled. Finally the chapter concludes with 
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a prioritization of the cross-sectorial requirements. As far as possible, the roadmaps 
have been quantified to allow for a well-founded prioritization and action plans 
(e.g. policies). 


15.2 Cross-sectorial Consolidated Requirements 


In order to establish a common understanding of requirements as well as technology 
descriptions across domains, the sector-specific requirement labels were aligned. 
Each sector provided their requirements with the associated user needs, and similar 
and related requirements were merged, aligned, or restructured to create a 
homogenous set. 

While most of the requirements exist within each of the sectors, the level of 
importance for the requirement in each sector varies. For the cross-sector analysis, 
any requirements that were identified by at least two sectors as being a significant 
requirement for that sector were included into the cross-sector roadmap definition. 
Thus, the initial list of 13 high-level requirements and 28 sub-level requirements 
was reduced to 5 high-level requirements and 12 sub-level requirements (see 
Table 15.1). Within this chapter, the discussion on each cross-sectorial requirement 
has been condensed and minor updates applied. Full details are available in Becker 
et al. (2014). 


15.2.1 Data Management Engineering 


The high-level requirement data management engineering aims at efficient strate- 
gies to manage heterogeneous data sources and technologies. Data management 
engineering has four sub-requirements: 


¢ Data enrichment 

¢ Data integration 

¢ Data sharing 

¢ Real-time data transmission 


15.2.1.1 Data Enrichment 


The sub-requirement data enrichment aims to make unstructured data understand- 
able across domains, application, and value chains. 

In the health sector, data enrichment is of high relevance, since 90 % of health 
data is only available in unstructured formats without semantic labels informing 
applications on the content of the data. In particular, approaches for the semantic 
annotation of medical images and medical text are needed. 
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Table 15.1 Consolidated cross-sectorial requirements (and demanding sectors) 


Technological Requirement 


Number of 
Demanding Sectors 


Health 

x [Public 
[Finance & Insurance 
Energy & Transport 
Manufacturing 


> (Retail 


Data Management Engineering 


Data Enrichment 


x xx Telecom &Media 


x< 
x< 
~ 
>x< 


Data Integration 
Data Sharing 
Real-Time Data Transmission 
Data Quality 
Data Improvement 
Data Security and Privacy 
Data Visualization and User Experience 
Deep Data Analytics 
Modelling Simulation 
Natural Language Analytics 


>< 
x x 
x 


K KK KK KK OK 
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Pattern Discovery 
Predictive Analytics 
Prescriptive Analytics 
Real-Time Insights 


N UUN WwW ww WN YN WwW EUW ND WwW 
K Ke KK 


x< 
ae e a MK 
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Usage Analytics 


In the telecom and media sector, data enrichment includes ontologies 
(e.g. eTOM SID), data transformation, addition of metadata, formats, etc., taking 
into account that the data sources are heterogeneous (including social media 
information, audio, customer data, and traffic data, for example). Data coming 
from different sources and in different formats, produced by heterogeneous sys- 
tems, have to be processed together. In order to address these requirements, the 
following challenges need to be tackled: 


¢ Information extraction from text 
¢ Image understanding algorithms 
e Standardized annotation framework 
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15.2.1.2 Data Sharing and Integration 


The sub-requirement data sharing and integration aims to establish a basis for the 
seamless integration of multiple and diverse data sources into a big data platform. 
The lack of standardized data schemas, semantic data models, as well as the 
fragmentation of data ownership are important aspects that need to be tackled. 

As of today, less than 30 % of health data is shared between healthcare providers 
(Accenture 2012). In order to enable seamless data sharing in the health and other 
domains, a standardized coding system and terminologies as well as data models 
are needed. 

In the telecom sector, data has been collected for years and classified according 
to business standards based on eTOM (2014), but the data reference model does not 
yet contemplate the inclusion of social media data. A unified information system is 
required that includes data from both the telecom operator and the customer. Once 
this information model is available, it should be incorporated in the eTOM SID 
reference model and taken into account in big data telecom-specific solutions for all 
data (social and non-social) to be integrated. 

In the retail sector, standardized product ontologies are needed to enable sharing 
of data between product manufacturers and retailers. Services to optimize opera- 
tional decisions in retail are only possible with semantically annotated product data. 

In the public sector, data sharing and integration are important to overcome the 
lack of standardization of data schemas and fragmentation of data ownership, to 
achieve the integration of multiple and diverse data sources into a big data platform. 
This is required in cases where data analysis has to be performed from data 
belonging to different domains and owners (e.g. different agencies in the public 
sector) or integrating heterogeneous external data (from open data, social networks, 
sensors, etc.). 

In the financial sector, several factors have put organizations in a situation where 
a large number of different datasets lack interconnection and integration. Financial 
organizations recognize the potential value of interlinking such datasets to extract 
information that would be of value either to optimize operations, improve services 
to customers, or even create new business models. Existing technology can cover 
most of the requirements of the financial services industry, but the technology is 
still not widely implemented. 

In order to address these requirements, the following challenges need to be 
tackled: 


e Semantic data and knowledge models 

e Context information 

¢ Entity matching 

e Scalable triple stores, key/value stores 

¢ Facilitate core integration at data acquisition 

e Best practice for sharing high-velocity and high-variety data 
e Usability of semantic systems 

e Metadata and data provenance frameworks 
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e Scalable automatic data/schema mapping mechanisms 


15.2.1.3 Real-Time Data Transmission 


The sub-requirement real-time data transmission aims at acquiring (sensor and 
event) information in real time. 

In the public sector, this is closely related with the increasing capability of 
deploying sensors and Internet of Things scenarios, like in public safety and smart 
cities. Image sensors have followed Moore’s Law, doubling megapixel density per 
dollar every 2 years (PWC 2014). Distributed processing and cleaning capabilities 
are required for image sensors in order to avoid overloading the transmission 
channels (Jobling 2013) and provide the required real-time analysis to feed situa- 
tional awareness systems for decision-makers. 

In the manufacturing sector, sensor data must be acquired at high sample rates 
and needs to be transmitted close to real time in order to be used effectively. 
Decisions can be made at central planning, command, and control points, or can 
be made at a local level in a distributed fashion. Data transmission must be 
sufficiently close to real time, greatly improving on the currently long intervals 
(hourly or greater) in which inventory data is sampled. The hostile working 
environment in manufacturing may hamper data transmission. 

For the retail sector, it is important that the data from sensors inside the store are 
acquired in real time. This includes visual data from cameras and customer loca- 
tions from positioning sensors. 

In order to address these requirements, the following challenges need to be 
tackled: 


¢ Distributed data processing and cleaning 
e Read/write optimized storage solutions for high velocity data 
¢ Near real-time processing of data streams 


15.2.2 Data Quality 


The high-level requirement, data quality, describes the need to capture and store 
high-quality data so that analytic applications can use the data as reliable input to 
produce valuable insights. Data quality has one sub-requirement: 


¢ Data improvement 


Big data applications in the health sector need to fulfil high data quality 
standards in order to derive reliable insights for health-related decisions. For 
instance, the features and parameter list used for describing patient health status 
needs to be standardized in order to enable the reliable comparison of patient 
(population) datasets. 
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In the telecom and media sectors, despite the fact that data has been collected 
already for years, there are still data quality issues that make the information 
un-exploitable without pre-processing. 

In the financial sector, data quality is not a major issue in internally generated 
datasets, but information collected from external sources may not be fully reliable. 

In order to address these requirements, the following challenges need to be 
tackled: 


e Provenance management 
¢ Human data interaction 
¢ Unstructured data integration 


15.2.2.1 Data Improvement 


The sub-requirement data improvement aims at removing noise/redundant data, 
checking for trustworthiness, and adding missing data. 

In the telecom and media sectors, this relates to the ability to improve the 
commercial offering of the service provider based on the available information in 
traditional systems, as well as advanced techniques such as predictive, speech, or 
prescriptive analytics. 

In the retail sector, both sensor data and data extracted from web sources 
(i.e. product data and customer data) are error prone and need to be checked for 
trustworthiness. Therefore data improvement procedures are required that help to 
remove incorrect/redundant data and noise. 


e Human validation via curation 
e Automatic removal of large amounts of noise at scale 
e Scalable semantic validation 


15.2.3 Data Security and Privacy 


The high-level requirement data security and privacy describes the need to protect 
highly sensitive business and personal data from unauthorized access. Thus, it 
addresses the availability of legal procedures and the technical means that allow 
the secure sharing of data. 

In healthcare applications, a strong emphasis has to be put on data privacy and 
security since some of the usual privacy protection approaches could be bypassed 
by the nature of big data. For instance, in terms of health-related data, 
anonymization is a well-established approach to de-identify personal data. Never- 
theless, the anonymized data could be re-identified (El Emam et al. 2014) when 
aggregating big data from different data sources. 

Big data applications in retail require the storage of personal information of 
customers in order for the retailer to be able to provide tailored services. It is very 
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important that this data is stored securely to ensure the protection of customer 
privacy. 

In the manufacturing sector, there are conflicting interests in storing data on 
products for easy retrieval and protection of data from unauthorized retrieval. Data 
collected during production and use may well contain proprietary information 
concerning internal business processes. Intellectual property needs to be protected 
as far as it is encoded in product and production data. Regulations for data 
ownership need to be established, e.g., what access may the manufacturer of a 
production machine have to its usage data. 

Privacy protection for workers interacting in an Industry 4.0 environment needs 
to be established. Data encryption and access control into object memories needs to 
be integrated. European and worldwide regulations need to be harmonized. There is 
a need for data privacy regulations and transparent privacy protection. 

In the telecom and media sector, one of the main concerns is that big data 
policies apply to personal data, i.e., to data relating to an identified or identifiable 
person. However, it is not clear whether the core privacy principles of the regulation 
apply to newly discovered knowledge or information derived from personal data, 
especially when the data has been anonymized or generalized by being transformed 
into group profiles. Privacy is a major concern which can compromise the end 
users’ trust, which is essential for big data to be exploited by service providers. An 
Ovum (2013) Consumer Insights Survey revealed that 68 % of Internet users across 
11 countries around the world would select a “Do-Not-Track” feature if it was 
easily available. This clearly highlights some amount of end users’ antipathy 
towards online tracking. Privacy and trust is an important barrier since data must 
be rich in order for businesses to use it. 

Finding solutions to ensure data security and privacy may unlock the massive 
potential of big data in the public sector. Advances in the protection and privacy of 
data are key for the public sector, as it may allow the analysis of huge amounts of 
data owned by the public sector without disclosing sensitive information. In many 
cases, the public sector regulations restrict the use of data for different purposes for 
which it was collected. Privacy and security issues are also preventing the use of 
cloud infrastructures (e.g. processing, storage) by many public agencies that deal 
with sensitive data. A new approach to security in cloud infrastructure may elim- 
inate this barrier. 

Data security and privacy requirements appear in the financial sector in the 
context of building new business models based on data collected by financial 
services institutions from their customers (individuals). Innovative services could 
be created with technologies that reconcile the use of data and privacy requirements. 
In order to address these requirements, the following challenges need to be tackled: 


¢ Hash algorithms 

e Secure data exchange 

¢ De-identification and anonymization algorithms 

¢ Data storage technologies to encrypted storage and DBs; proxy re-encryption 
between domains; automatic privacy-protection 


270 T. Becker et al. 


e Advances in “privacy by design” to link analytics needs with protective controls 
in processing and storage 

e Data provenance to enable usage transparency and metadata for privacy 
information 


15.2.4 Data Visualization and User Experience 


The high-level requirement data visualization and user experience describes the 
need to adapt the visualization to the user. This is possible by reducing the 
complexity of data, data inter-relations, and the results of data analysis. 

In retail it will be very important to adapt the information visualization to the 
specific customer. An example of this would be tailored advertisements, which fit 
the profile of the customer. 

In manufacturing human decision-making and guidance need to be supported on 
all levels: from the production floor to high-level management. Appropriate data 
visualization tools must be available and integrated to support browsing, control- 
ling, and decision-making in the planning and execution process. This applies 
primarily to general big data but extends to and includes special visualization of 
spatiotemporal aspects of the manufacturing process for spatial and temporal 
analytics. 

In order to address these requirements, the following challenges need to be 
tackled: 


e Apply user modelling techniques to visual analytics 

e High performance visualizations 

e Large-scale visualization based on adaptive semantic frameworks 
e Multimodal interfaces in hostile working environments 

e Natural language processing for highly variable contexts 

e Interactive visualization and visual queries 


15.2.5 Deep Data Analytics 


The high-level requirement deep data analytics is the application of sophisticated 
data processing techniques to yield information from multiple, typically large 
datasets comprised of both unstructured and semi-structured data. Deep data anal- 
ysis has seven sub-requirements: 


¢ Modelling and simulation covers domain-specific tools for modelling and sim- 
ulation of events according to changes from past events. 

¢ Natural language analytics aims at extracting information from unstructured 
sources (e.g. social media) to enable further analysis (for instance sentiment 
mining). 
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¢ Pattern discovery aims at identifying patterns and similarities. 

¢ Real-time insights enable the analysis of real-time data for instant decision- 
making. 

e Usage analytics provide analysis of the usage of product, service, resources, 
process, etc. 

¢ Predictive analytics utilize a variety of statistical, modelling, data mining, and 
machine learning techniques to study recent and historical data to make pre- 
dictions about the future. 

¢ Prescriptive analytics focus on finding the best course of action for a given 
situation. 


Prescriptive analytics belongs to a portfolio of analytic capabilities that include 
descriptive and predictive analytics. While descriptive analytics aims to provide 
insight into what has happened, and predictive analytics helps model and forecast 
what might happen, prescriptive analytics seeks to determine the best solution or 
outcome among various choices, given the known parameters. 

In the public sector, deep data analytics can help in several scenarios where 
information should be extracted from data. In the scenario of monitoring and 
supervision of online gambling operators, the challenge is to detect specific crim- 
inal or illegal behaviours using pattern discovery to deliver real-time insights. 
Similar insights are needed in the supervision of markets regulated by the public 
sector (energy, telecommunications, stock markets, etc.). 

Other application scenarios also need deep data analytics, as in the case of public 
safety in smart cities, where real-time insights can enable the analysis of fresh/real- 
time data for instant decision-making. In these scenarios, situational awareness 
systems can be built using real-time data provided by networks of sensors and near 
real-time data captured from social networks through natural language analytics. 
Smart cities situation awareness can also apply modelling and simulation tools for 
managing events (e.g. managing large crowds of people in public events) to 
anticipate the results from decisions taken to influence the current conditions in 
real-time. 

Other application scenarios like predictive policing may require the use of 
predictive analytics to provide insights based on the learning from previous situa- 
tions. This would allow for optimal security resources allocation, according to the 
prediction of incidents, which may be based on temporal patterns or related to 
specific events of any kind (sport events, weather conditions, or any other variable). 

For the telecom and media sectors, deep data analytics are required in order to 
improve customer experience, either by tailoring the offerings, by improving 
customer care, or by proactively adapting resources (e.g. network) to meet the 
customer expectations in terms of service delivery. This can be achieved by 
obtaining a 360° customer view, which allows a better understanding of the 
customer and predicts their needs or demands. Advanced and flexible customer 
segmentation, knowing customer likes and dislikes, deeply analysing user habits, 


272 T. Becker et al. 


customer interactions, etc., help communication and content service providers to 
find patterns and sentiment out of the data, allowing cross selling based on multiple 
factors. Since Quality of Experience (QoE) and customer satisfaction can differ 
very quickly (as mood does), analytics should ideally provide the means to calcu- 
late and automate the best next action in real time. 

Historical and online analytical processing of big data will be adopted as the 
insights gained will make planning and operations more precise. Real-time analyt- 
ics on the other hand still faces some technological challenges, which may well be 
the reason for the lack of adoption of real-time analytics in energy and transporta- 
tion. Manual steps in typical data analytics processes, such as data wrangling, for 
example, do not scale for the speed and volume of data to be analysed in operational 
efficiency scenarios in energy and transportation optimization. 

In the retail sector, operational decisions can be optimized by analysing unstruc- 
tured data from the web. This can be information about upcoming regional events, 
weather data, or even potential natural disasters that can be extracted from social 
networks using natural language analytics. Data, like visual data from cameras, 
acquired from sensors inside the store needs to be analysed to extract specific 
patterns, such as patterns of customer movement. Customer segmentation is possi- 
ble by analysing customer—product and customer-staff interactions. This informa- 
tion can also be used to run prescriptive analytics. These are required to allow 
intelligent inventory, intelligent staff scheduling, and floor plan/ product location 
optimization. 

In order to address these requirements, the following challenges need to be 
tackled: 


¢ Data integration, linking, and semantics 

e Sentiment analysis 

e Machine learning 

¢ Integrating semantics into large-scale modelling and simulation environments 

e Increasing scalability and robustness of information extraction, named entity 
recognition, machine learning, linked data, entity linking, and co-reference 
resolution 

e Validation of pattern analytics outputs and natural language analytics outputs 
with humans via curation 

¢ Integration of natural language analytics into data usage scenarios 

e Semantic pattern technologies including stream pattern matching and scalable 
complex pattern matching 

e Analytical databases to efficiently support predictive analytics 

e Combining large-scale reasoning with statistical approaches 

e Predictive maintenance: predict failures, determine maintenance intervals Sup- 
port for failure analysis 

¢ Extend predictive analytics to prescriptive analytics 

e Complex event processing applies business rules (or other frameworks) contin- 
uously on defined (short) interval of real-time data stream with low latency 
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¢ In-memory technology, new visualization and interaction techniques, automatic 
system reactions to enable ad hoc queries on large datasets to be executed with 
minimal latencies 

¢ Real-time and in-stream analytical processing 


15.3 Prioritization of Cross-sectorial Requirements 


An actionable roadmap should have clear selection criteria regarding the priority of 
all actions. In contrast to a technology roadmap for the context of a single company, 
a European technology roadmap needs to cover developments across different 
sectors. The process of defining the roadmap included an analysis of the big data 
market and feedback received from stakeholders. Through this analysis, a sense of 
what characteristics indicate higher or lower potential of big data technical require- 
ments was reached. 

As the basis for the ranking, a table-based approach was used that evaluated each 
candidate according to a number of applicable parameters. In each case, the 
parameters were collected with the goal of being sector independent. Quantitative 
parameters were used where possible and available. 

In consultation with stakeholders, the following parameters were used to rank 
the various technical requirements. The ranking parameters included: 


¢ Number of affected sectors 

e Size of affected sector(s) in terms of % of GDP 

¢ Estimated growth rate of the sector(s) 

e Possible prognosticated estimated growth rate by the sector due to big data 
technologies 

¢ Estimated export potential of the sector(s) 

¢ Estimated cross-sectorial benefits 

e Short-term low-hanging fruit 


Using these insights, a prioritization composed of multiple parameters was 
created, which give a relative sense of which technological requirements might 
be poised for greater gains and which would face the lowest barriers. The ranking of 
cross-sectorial technical requirements is presented in Table 15.2 and is illustrated in 
Fig. 15.1, where colour indicates the level of estimated importance, and the size of 
the bubble the estimated affected sectors of the industries. It is important to note 
that these indices do not offer a full picture, but they do offer a reasonable sense of 
both potential availability and capture across sectors. There are certain limitations 
to this approach. Not all relevant numbers and inputs were available as the speed of 
technology development and adoption relies on several factors. The ranking relies 
on forecasts and estimates from third parties and the project team. As a conse- 
quence, it is not always possible to determine precise numbers for timelines and 
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Table 15.2 Prioritization of technical cross-sectorial requirements 


Prioritization Technological requirements Score 


Level 1: Urgent 


Data security and privacy 78 
Data management engineering—data integration 69.25 
Deep data analytics—teal-time insights 61.5 
Data management engineering—data sharing 48.5 
Level 2: Very important 
Data quality 40.5 
Data management engineering—treal-time data transmission | 37 
Deep data analytics—modelling simulation 37 
Deep data analytics—natural language analytics 37 
Deep data analytics—pattern discovery 34.25 
Deep data analytics 31.75 
Data management engineering 31.5 
Level 3: Important 
Data management engineering—data enrichment 29.5 
Data visualization and user experience 29.5 
Deep data analytics—prescriptive analytics 29.5 
Deep data analytics—usage analytics 26.75 
Data quality—data improvement 24 
Deep data analytics—predictive analytics 20.75 


specific impacts. Further investigation into these questions would be desirable for 
future research. Full details of the ranking process are available in (Becker, T., 
Jentzsch, A., & Palmetshofer, W. 2014). 


15.4 Summary 


The aim of the cross-sectorial roadmap is to maximize and sustain the impact of big 
data technologies and applications in the different industrial sectors by identifying 
and driving opportunities in Europe. While most of the requirements identified exist 
in some form within each sector, the level of importance of the requirements 
between specific sectors varies. For the cross-sector requirements, any requirements 
that were identified by at least two sectors as being a significant requirement for the 
sector were included into the cross-sector roadmap definition. This led to the 
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Scoring of technical cross-sectorial requirements 


© Level 1: Urgent requirements 


O Level 2: Very important requirements 


Data 
Improve- 


© Level 3: Important requirements 
ment 


Analytics 


Prescript. 
Analytics 


Predictive 
Analytics 


Fig. 15.1 Cross-sectorial requirements prioritized 


identification of 5 high-level requirements and 12 sub-level requirements with 
associated challenges that need to be tackled. 

Each cross-sectorial requirement was prioritized based on their expected impact. 
The consolidated results comprise a prioritized set of cross-sector requirements that 
were used to define the cross-sectorial roadmaps with associated action 
recommendations. 


Open Access This chapter is distributed under the terms of the Creative Commons Attribution- 
Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any 
noncommercial use, distribution, and reproduction in any medium, provided the original author(s) 
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Chapter 16 

New Horizons for a Data-Driven Economy: 
Roadmaps and Action Plans for Technology, 
Businesses, Policy, and Society 


Tilman Becker, Edward Curry, Anja Jentzsch, and Walter Palmetshofer 


16.1 Introduction 


A key objective of the BIG project was to define a big data roadmap that takes into 
consideration technical, business, policy, and society aspects. This chapter 
describes the integrated cross-sectorial roadmap and action plan. 

The second objective of the BIG project was to set up an industrial-led initiative 
around intelligent information management and big data to contribute to EU 
competitiveness and position it in Horizon 2020. This objective was reached in 
collaboration with the NESSI European Technology Platform with the launch of the 
Big Data Value Association (BDVA). 

Finally the implementation of the roadmaps required a mechanism to transform 
the roadmaps into real agendas supported by the necessary resources (economic 
investment of both public and private stakeholders). This was secured with the 
signature of the Big Data Value cPPP (BDVcPPP) between the BDVA and the 
European Commission. The cPPP was signed by Vice President Neelie Kroes, the 
then EU Commissioner for the Digital Agenda, and Jan Sundelin, the President of 
the Big Data Value Association (BDVA), on 13 October 2014 in Brussels. The 
BDV cPPP provides a framework that guarantees the industrial leadership, 
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investment, and commitment of both the private and public side to build a data- 
driven economy across Europe. The strategic objective of the BDVcPPP is to 
master the generation of value from big data and create a significant competitive 
advantage for European industry that will boost economic growth and jobs. The 
BDVA has produced a Strategic Research & Innovation Agenda (SRIA) on Big 
Data Value that was initially fed by the BIG technical papers and roadmaps and was 
extended with the inputs of a public consultation that included hundreds of addi- 
tional stakeholders representing both the supply and the demand side. 

This chapter describes the technology, business, policy, and society roadmaps 
defined by the BIG project. It then introduces the Big Data Value Association and 
the Big Data Value contractual Public Private Partnership and describes the role 
played by the BIG project in their establishment. The BDVA and the BDV cPPP 
will provide the necessary framework for industrial leadership, investment, and 
commitment of both the private and the public side to build a data-driven economy 
across Europe. 


16.2 Enabling a Big Data Ecosystem 


Big data is becoming a ubiquitous practice in both the public and private worlds. It 
is not a standalone solution and depends on many layers like infrastructure, Internet 
of Things, broadband, networks and open source, among many others. Furthermore, 
critical are the non-technical issues including policy, skills, regulation, and business 
models. 

Big data has to be embedded in the European business agenda. Policymakers 
therefore need to act in a timely manner to promote an environment that is 
supportive to organizations seeking to benefit from this inevitable progression 
and the opportunities it presents. Failure to develop a comprehensive big data 
ecosystem in the next few years carries the risk of losing further competitive 
advantage in comparison to other global regions. 

The roadmaps described in this chapter outline the most urgent and challenging 
issues for big data in Europe. They are based on over 2 years of research and input 
from a wide range of stakeholders with regard to policy, business, society, and 
technology. The roadmaps will foster the creation of a big data ecosystem. They 
will enable enterprises, business (both large and small), entrepreneurs, start-ups, 
and society to gain from the benefits of big data in Europe. This chapter presents a 
summary of the roadmaps; a full description is available in Becker et al. (2014). 
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16.3 Technology Roadmap for Big Data 


In order to determine which technologies are needed at what point in time a 
systematic approach for predicting technology developments is needed. The 
sector-specific technology roadmaps developed establish such a framework by 
aligning user needs and associated requirements with technological advances and 
the related research questions. In contrast to a technology roadmap developed in the 
context of a single company, the approach taken here covers the development of a 
technology roadmap for the European market. As a consequence, it was not possible 
to come up with a precise timeline of technology milestones, as the speed of 
technology development and its adoption relies (a) on the degree to which the 
identified non-technical requirements will be addressed and (b) on the extent to 
which European organizations are willing to invest in and leverage big data. 

Figure 16.1 depicts a consolidated technology roadmap for big data. For sector- 
specific technology roadmaps, refer to Part II of this book and Zillner et al. (2014). 
For a more detailed description of the consolidated technology roadmap, see Becker 
et al. (2014). 


16.4 Business Roadmap for Big Data 


The role of business is critical to the adoption of big data in Europe. Businesses 
need to understand the potential of big data technologies and have the capability to 
implement appropriate strategies and technologies for commercial benefit. The big 
data business roadmap is presented in Table 16.1. 


Attitude of Change and Entrepreneurial Spirit The majority of European com- 
panies and their leaderships need to tackle the core issue of using data to drive their 
organization. This requires that data-driven innovation becomes a priority at the top 
level of the organization, not just in the IT department. An entrepreneurial spirit is 
needed in the leadership team to deal with fast changes and uncertainties in the big 
data business world. Change, even with the possible consequence of failure, should 
be embraced. 


Business Models In the coming years, the business environment will undergo 
major changes due to transformation by big data. Existing business models may 
change and new models will emerge. Businesses are still unclear what data analyses 
are of relevance and value for their business, and the return on investment is often 
unclear. However, they recognize the need to analyse the data they amass for 
competitive advantage and to create new business opportunities. The adaption to 
these changes will be crucial to the success of many organizations. 


Privacy by Design Privacy by design can gain more trust from customers and 
users. Europe needs to take a leading role in incorporating privacy by design with 
the business operations of all its sectors. 
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Data Management Engineering | 


Fragment selection for graph-like data (GIS) through Quantum computing » 
Linked Data (sharing), ontologies, integration > Ease of use for semantic systems > 
Metadata and data provenance frameworks > Tool integration > 
Data acquisition (e.g. Storm) > 
Write optimized storage solution » Performance improvement of random read/write (databases) > 
> Distributed data processing and cleaning > 


Data stream management > 
Proprietary APIs > Š Social APIs > 


Privacy and anonymization at collection time > 
Automatic data & schema mapping > 

Wrappers/mediators for distributed data encapsulation > > Scaling methods > 

Lambda Architecture > Big Data reference architecture > 

In-memory databases » Improved performance 

Column stores > Approximate query processing 

Triple stores > Scalable triple stores > 

Linked Data, RDF >» 

Access policies P Data provenance »> Semi-automation p Differential privacy > 


ID technologies >» > Object memories > 


Entity matching, Standardized Knowledge-based 
Data alignment annotation IE 


Multilingual data 
Anomaly detection Framework 
(images) 


2015 2020 
Deep Data Analytics 
Semantic pattern technologies >> Machine learning for discovery of data curation patterns 
Human validation of pattern analytics results >> Standard Array Query Language > 
[Linked Data & machine learning analysis > Adchoc queries > 


In-memory databases > 
Analytical databases > D Efficient predictive analysis in DBs > 
Entity linking, co-reference solution > 


Integration of NLP pipelines into data curation > 
Human validation of NLA results > 
Temporal databases >> Management of time-series data for effective analysis > 
l Language modelling » Scalability for real-time data 
l Multi-attribute decision models 
Stream-based data mining > 
Reasoning f common ontology > Natural language Data integration 
generation across 3 Vs 


Sei utomated Semantic Real-time automation of 
analytics workflows linkage analytics deployment 
Simulation tools > Spatio-temporal simulation > 


Information Extraction, 
Linked Data, NER, Sentiment Scalability for real-time data 
Analysis, Machine Learning 


2015 2020 


Fig. 16.1 (continued) 
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Data Quality | 


Trust & permission NLP & schema agnostic N| Integrated trust & Context-aware 
management queries permission mgmt. integration of 
unstructured data 
Manual processing & validation > 


Automatic, scalable data curation & validation 


Improved precision & reliability > 
Semantics / Linked Data > Ontology population > 
Efficient crowdsourced data curation process > 


Robust, generic and automated data curation infrastructure > 


Robust query mechanisms > 
Curation generic tool integrated to Standardization of 
NLP pipelines product ontologies 
Human validation Efficient crowdsourced Robust data curation 
via curation data curation process infrastructures 

2015 2020 


Data Visualization & User Experience 
Audio-reduction GPU-based 
technologoies visulization systems 


Automated speech 
recognition 


In-memory Complex event 
computing processing 
Visualization > > Visual analytics 


> 
NLP > > NLP report generation & interfaces >> Full NLP > 


Apply user modelling High-performance, large scale visualization based on 
techniques to visual analytics adaptive semantic frameworks 


2015 2020 


Data Security & Privacy 
Secure data Privacy through Anonymozation, 
exchange profiles hash algorithms pseudonymization 


and k-anonymity 
Encrypted storage and DBs > 
Privacy by design, Queries on encrypted storage > 


Encrypted storage (Lambda, NoSQL) >> Best practices > 


Metadata description for 
handling data privacy 
information 


Anonymization > 


2015 2020 


Fig. 16.1 Technology roadmap for big data 


Education of Workforce There is a war for big data talent. Businesses should 
focus on training and educating all their staff, not just from the IT departments, with 
the necessary big data related skills. 


Standardization Businesses need to work with other stakeholders and organ- 
izations to create the necessary technology and data standards to enable a big 
data ecosystem. The lack of standards, due to the non-interoperability, for example, 
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Table 16.1 Business roadmap for big data 


T. Becker et al. 


Business 

1. Attitude of 
change and entre- 
preneurial spirit 


2015 

The change at top-level management 
starts and entrepreneurial activity is 
encouraged. 


2019 or earlier 


Top-level management in 
European businesses have a big 
data-driven mind-set. 


2. Business models 


3. Privacy by 


Exploring business models driven by 
big data. 


Start implementing privacy by 


Successfully exploiting new big 
data business models. 


Privacy by design by default. 


design design. 
4. Education of New workforce educational pro- Significant increase of big data 
workforce grams on big data. savvy employees in all 


departments. 


5. Standardization 


Identify critical standardization 
needed. 


Major steps in standardization are 
achieved. 


6. Increasing 


Increasing big data R&D spend. 


Minimum of 25 % increase in big 


research and 
development 


data R&D spend 


of NoSQL databases and SQL databases, is a major barrier for faster adoption of 
big data. 


Increasing Research and Development Businesses have to focus on not losing 
the edge and invest in big data R&D to gain a competitive advantage for their 
organizations. Appropriate supports should be put in place within both the public 
and private sectors to foster the necessary research and innovation needed for big 
data value. 


16.5 Policy Roadmap for Big Data 


European policies and agendas are critical to ensuring that big data can reach its full 
potential in Europe. The policy roadmap for big data is available in Table 16.2. 


Education and Skills Recognition and promotion of digital literacy as an impor- 
tant twenty-first century skill is one of the most crucial areas for the long-term 
success of big data in Europe. There is already a huge shortage of IT and big data 
professionals, and Europe is predicted to face a shortage of up to 900,000 ICT 
professionals by 2020.' The skills shortage is risking the potential for growth and 
digital competitiveness. According to a number of studies, the demand for specific 
big data workers (e.g., data scientists, data engineers, architects, analysts) will 
further increase by up to 240 % in the next 5 years? which could result in an 
additional 100,000 data-related jobs by 2020. This problem affects not only the big 


1 http://europa.eu/rapid/press-release_IP-14-1129_en.htm 
2 http://ec.europa.eu/information_society/newsroom/cf/dae/document.cfm?doc_id=6243 
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Table 16.2 Policy roadmap for big data 
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Policies 


1. Education and 
skills 


2015 


Big data education shortcom- 
ings are tackled. 


2019 or earlier 


Best continent for big data education. 


2. Digital single 
market 


Focus on creating a single 
European data market. 


Single European data market for 


500 million users established. 


3. Funding for big 
data technology 


Maintain current funding 
levels (850 Mio.). 


Double the size of venture capital 


scene in Europe as of 2015. 


4. Open data and 
data silos 


Discussion on open govern- 
ment data by default. 


Europe leading in open data. Mini- 


mized data silos. 


5. Privacy and legal 


Starting public debate, EU 
Data Protection signed. 


Appropriate balance for people and 


businesses reached. 


6. Foster technical 
infrastructure 


Continue fostering the IT 
environment. 


European infrastructure competitive 
with or surpasses US/Asia. 


data domain, but also the whole digital landscape and has to be addressed in a 
general, broad, and urgent manner. Data and code-literacy should be integrated into 
standard curriculum from an early age. Specific big data skills like data engineer- 
ing, data science, statistical techniques, and related disciplines should be taught in 
institutions of higher education. Easier access to work permits for non-Europeans 
should also be considered to help spur the European big data economy. 


European Digital Single Market Despite the fact that the digital economy has 
existed for some time now, the EU’s single market is still functioning best in more 
traditional areas like the trade of physical goods. It has so far failed to adapt to many 
of the challenges of the digital economy. 

An established digital single market could lead the world in digital technology. 
Policymakers need to promote harmonization. This means combining 28 different 
regulatory systems, removing obstacles, tackling fragmentation, and improving 
technical standards and interoperability. Reaching this goal by 2019 is quite 
ambitious, but it is a necessary step towards a future European common data area. 


Funding for Big Data Technology Create a friendlier start-up environment with 
increased access to funding. There is a lack of appropriate funding for research and 
innovation. Public supports and funding should increase. However, given the 
current budget constraints in Europe, alternative approaches also need to be consi- 
dered (such as providing legal incentives for investment in big data, European 
Investment Bank, etc.). 

Europe is also lacking an entrepreneurial atmosphere (i.e. venture capital spent 
per capita in comparison to USA or Israel). Fostering a better private financing 
environment for start-ups and SMEs is crucial. 


Privacy and Legal Provide clear, understandable, reasonable rules regarding data 
privacy. When it comes to privacy rights and big data a double challenge is faced, 
lacking a European Digital Single Market, and the absence of unified user rights. 
This needs to be urgently addressed, since confidence and adoption of big data 
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technology is dependent on the trust of the user. According to the latest indications 
EU Data Protection is expected to be signed in 2015, but a broader discussion will 
still be needed. Other areas that need to be considered are copyright and whether 
there is the right of data ownership. 

No matter how quickly technology advances, it remains within the citizens’ 
power to ensure that both innovation is encouraged and values are protected 
through law, policy, and the practices encouraged in the public and private sector. 
To that end, policymakers should set clear rules regarding data privacy so that 
organizations know what personal data they can store and for how long, and what 
data is explicitly protected by privacy regulations. Policy makers need to advance 
consumer and privacy laws to ensure consumers have clear, understandable, 
reasonable standards for how their personal information is used in the big data era. 


Open Data and Data Silos Open data can create a cultural change within organ- 
izations towards data sharing and cooperation. From reducing the costs of data 
management to creating new business opportunities, many organizations are 
gaining benefits from opening up and sharing selected enterprise data. European 
governments need to start the discussion on openness by default. Harnessing data as 
a public resource to improve the delivery of public services. The sooner European 
governments open their data the higher the returns. Big Open Data should be the 
goal where possible. 


Foster Technical Infrastructure Big data is not a standalone solution and 
depends on many layers like infrastructure, Internet of Things, broadband access 
for users, networks, open source, and many more. The cross-fertilization of these 
layers is vital to the success of big data. A technology push is needed to strengthen 
European technology providers to provide big data infrastructure that is competitive 
or leading when compared to other regions. 


16.6 Society Roadmap for Big Data 


In addition to the business and policy roadmaps presented, a roadmap for society in 
Europe has been defined. Without the support of the European citizen the up-take of 
big data technologies can be delayed and the opportunities available lost. A 
campaign to increase the awareness of the benefits of big data would be useful in 
order to motivate European citizens and society. This campaign could include the 
promotion of role models (especially females, and people with diverse back- 
grounds) and the positive long-term effects of the IT and innovation sectors. The 
society roadmap for big data is presented in Table 16.3. 


Education and Skills Knowledge of mathematics and statistics, combined with 
coding and data skills is the basis for big data literacy. Improving big data literacy is 
important for the data-driven society. It is important that members of society 
develop fluency in understanding the ways in which data can be collected and 
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Table 16.3 Society roadmap for big data 


Society 2015 2019 or earlier 
1. Education and Are you already coding? Four times the coders and big 
skills data skilled people in Europe as 
in 2014. 
2. Collaborative Are you connected? Leading continent with regard 
networks to a democratic big data 
community. 
3. Open data Are you already engaged in open data? | Europe is the leading open data 
society. 
4. Entrepreneurship | Are you a data innovator? Significant increase of big data 
entrepreneurship. 
5. Civil Are you voting or staying in contact Europe is the most digital and 
engagement with your Member of the European political big data engaged 
Parliament (MEP)? society. 
6. Privacy and trust | What’s your stance on privacy? Do you | Europe leading continent in 
trust big data? privacy. Significant increase of 
trust in big data. 


shared, how algorithms are employed, and for what purposes. It is important to 
ensure citizens of all ages have the ability and necessary tools to adequately protect 
themselves from data use and abuse. Initiatives such as “Code Week for Europe” 
are good exemplars for similar events in the big data domain. 


Collaborative Networks All segments of society, from hacker spaces to start-ups, 
from SMEs to bigger businesses, from angle investors to politicians in Brussels, 
have to pull together to advance the big data agenda in Europe. Europe has the 
chance to become the continent to embrace big data through a bottom-up demo- 
cratic process. 


Open Data Open data is a good way to engage citizens and to illustrate the 
positive benefits of big data for organizational change, efficiency, and transparency 
(of course only with non-personal open government data). The goal should be big 
open data for Europe. 


Entrepreneurship Current IT and big data developments impact the business 
world and society as a whole in a tremendous way. The opportunity to change 
things for the better for society needs to be taken. Affordable access to tools, data, 
technologies, and services are needed to foster an ecosystem of supports for both 
commercial and social entrepreneurs to exploit the potential of big data to create 
new products and services, establish start-ups, and drive new job creation. 


Civil Engagement Every person in Europe can change the way Europe deals with 
the effects of big data by influencing the politics and policies in Brussels. Citizens 
need to understand that “Europe is you” and that their participation in the political 
life of the European community during this era of digital transition is needed. Civil 
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society has to play a crucial role, which relies on every single citizen being an 
engaged citizen. 


Privacy and Trust An urgent point for the success of big data in Europe is the 
need for an open discussion on the pros and cons of big data and privacy to build the 
trust of citizens. The different points of view that exist in European member states 
and their citizens need to be addressed. Trust has to be established in a European 
digital single data market where both consumer and civil liberties are protected. 
Citizens have to raise their voice; otherwise their demands will not be heard in the 
on-going discussions on privacy. 


16.7 European Big Data Roadmap 


The final step was to create an integrated roadmap that takes into consideration 
technical, business, policy, and society aspects. The resulting European big data 
roadmap is a consensus reflecting roadmap with defined priorities and actions 
needed for big data in Europe. The roadmap (as illustrated Fig. 16.2) is the result 
of over 2 years of extensive analysis and engagement with stakeholders in the big 
data ecosystem. It is important to note that while actions are visualized sequentially, 
in reality many can and should be tackled at the same time in parallel, as detailed in 
specific roadmaps. 


European Big Data Roadmap 
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Fig. 16.2 European big data roadmap 
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16.8 Towards a Data-Driven Economy for Europe 


In her many speeches as European Digital Commissioner, Neelie Kroes called for 
action from European stakeholders to mobilize across society, industry, academia, 
and research to enable a European big data economy. VP Kroes identified it was 
necessary to establish and support a framework to ensure there are enough high- 
skilled data workers (analysts, programmers, engineers, scientists, journalists, 
politicians, etc.) to be able to deliver the future technologies, products, and services 
needed for big data value chains and to ensure a sustainable stakeholder community 
in the future. 

A key aim of the BIG project was to create new and enhance existing connec- 
tions in the current European-wide big data ecosystem, by fostering the creation of 
new partnerships that cross sectors and domains. Europe needs to establish strong 
players in order to make the entire big data value ecosystem, and consequently 
Europe’s economy, strong, vibrant, and valuable. BIG recognized the need to create 
venues that enable the interconnection and interplay of big data ideas and capabili- 
ties that would support the long-term sustainability, access, and development of a 
big data community platform. The linking of stakeholders would form the basis for 
a big data-driven ecosystem as a source for new business opportunities and inno- 
vation. The cross-fertilization of stakeholders is a key element for advancing the 
sustainable big data economy. 


16.9 Big Data Value Association 


The Big Data Public Private Forum, as it was initially called, was intended to create 
the path towards implementation of the roadmaps. The path required two major 
elements: (1) a mechanism to transform the roadmaps into real agendas supported 
by the necessary resources (economic investment of both public and private stake- 
holders) and (2) a community committed to making the investment and collabo- 
rating towards the implementation of the agendas. 

The BIG consortium was convinced that achieving this outcome would require 
creating a broad awareness and commitment outside of the project. BIG took the 
necessary steps to contact major players and to liaise with the NESSI European 
Technology Platform to jointly work towards this endeavour. The collaboration was 
set up in the summer of 2013 and allowed the BIG partners to establish the 
necessary high-level connections at both industrial and political levels. The objec- 
tive was reached in collaboration with NESSI with the launch of the Big Data Value 
Association (BDVA) and the Big Data Value contractual Public Private Partnership 
(BDV cPPP) within Horizon 2020. 

The BDVA is a fully self-financed not-for-profit organization under Belgian law 
with 24 founding members from large and small industry and research, including 
many partners of the BIG project. The BDVA is an industrially led representative 
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community of stakeholders ready to commit to a big data value cPPP with a 
willingness to invest money and time. 

The objective of the BDVA is to boost European big data value research, 
development, and innovation. It aims to: 


¢ Strengthen competitiveness and ensure industrial leadership of providers and 
end users of big data value technology-based systems and services 

¢ Promote the widest and most effective uptake of big data value technologies and 
services for professional and private use 

¢ Establish scientific excellence as the base for the creation of value from big data 


The BDVA will carry out a number of activities to achieve its objectives, these 
include: 


¢ Developing strategic goals for European big data value research and innovation, 
and supporting their implementation 

¢ Improving the industrial competitiveness of Europe through innovative big data 
value technologies, applications, services, and solutions 

e Strengthening networking activities of the European big data value community 

¢ Promoting European big data value offerings and organizations 

e Reaching out to new and existing users 

e Contributing to policy development, education, and the ramification of techno- 
logy in ethical, legal, and societal areas 


16.10 Big Data Value Public Private Partnership 


The BDVA developed a Strategic Research & Innovation Agenda (SRIA) on 
Big Data Value (BDVA 2015) that was initially fed by the BIG technical papers 
and roadmaps and extended with the inputs of a public consultation that included 
hundreds of additional stakeholders representing both the supply and the demand 
side. The BDVA then developed a cPPP (contractual PPP) proposal as the formal 
step to set up a PPP on big data value. The cPPP proposal builds on the SRIA by 
adding additional content elements such as potential instruments that could be used 
for the implementation of the agenda. 

A vital role in the European big data landscape will be fulfilled by the Big Data 
Value contractual Public Private Partnership (BDV cPPP). On 13 October 2014 the 
signature of BDV cPPP took place in Brussels, by the then European Commission 
Vice-President Neelie Kroes and the President of the BDVA Jan Sundelin, TIE 
Kinetix. The BDVA is the industry-led contractual counterpart to the European 
Commission for the implementation of the BDV cPPP. The main role of the BDVA 
will be to regularly update the Big Data Value SRIA, define and monitor the metrics 
of the BDV cPPP, and participate with the European Commission in the BDV cPPP 
partnership board. 
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The signature of the BDV cPPP is the first step towards building a thriving data 
community in the EU. The BDV cPPP is driven by the conviction that research and 
innovation focusing on a combination of business and usage needs is the best long- 
term strategy to deliver value from big data and create jobs and prosperity. The 
strategic objectives of the BDV cPPP as stated in the BDV SRIA (BDVA 2015) are: 


¢ Data: To access, compose, and use data in a simple, clearly defined manner that 
allows the transformation of data into information. 

e Skills: To contribute to the conditions for skills development in industry and 
academia. 

e Legal and Policy: To contribute to policy processes for finding favourable 
European regulatory environments, and address the concerns of privacy and 
citizen inclusion. 

e Technology: To foster European BDV technology leadership for job creation 
and prosperity by creating a European-wide technology and application base and 
building up competence. In addition, enable research and innovation, including 
the support of interoperability and standardization, for the future basis of BDV 
creation in Europe. 

¢ Application: To reinforce the European industrial leadership and capability to 
successfully compete on a global-level in the data value solution market by 
advancing applications transformed into new opportunities for business. 

¢ Business: To facilitate the acceleration of business ecosystems and appropriate 
business models with particular focus on SMEs, enforced by Europe-wide 
benchmarking of usage, efficiency, and benefits. 

e Social: To provide successful solutions for the major societal challenges that 
Europe is facing such as health, energy, transport, and the environment. And to 
increase awareness about BDV benefits for businesses and the public sector, 
while engaging citizens as prosumers to accelerate acceptance and take-up. 


Given the broad range of objectives around focusing on the different aspects of 
big data value a comprehensive implementation strategy is needed. The BDVA 
SRIA (BDVA 2015) details an interdisciplinary implementation approach that 
integrates expertise from the different fields necessary to tackle both the strategic 
and specific objectives of the BDV cPPP. The strategy contains a number of 
different types of mechanisms, including cross-organizational and cross-sectorial 
environments known as i-Spaces, as illustrated in Fig. 16.3, which will allow 
challenges to be tackled in an interdisciplinary manner while also serving as hubs 
for research and innovation activities, lighthouse projects which will raise aware- 
ness of the opportunities offered by big data and the value of data-driven appli- 
cations for different sectors, technical projects which will address targeted aspects 
of the technical priorities, and projects to foster and support efficient cooperation 
and coordination across all BDV cPPP activities. 
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European Innovation Spaces for providing 

secure places for data, building skills, 
identifying best practices and maturing 
tools 


Application 


Fig. 16.3 Interconnected challenges of the BDV cPPP within i-Spaces [from BDVA (2015)] 


16.11 Conclusions 


A key objective of the BIG project was to define a European big data roadmap that 
takes into consideration technical, business, policy, and society aspects. This 
chapter details the resulting cross-sectorial roadmap and associated action plans. 
The second objective of the BIG project was to set up an industrial-led initiative 
around intelligent information management and big data to contribute to EU 
competitiveness and position it in Horizon 2020. The Big Data Public Private 
Forum, as it was initially called, was intended to create the path towards implement- 
ation of the roadmaps. The path required two major elements: (1) a mechanism to 
transform the roadmaps into real agendas supported by the necessary resources 
(economic investment of both public and private stakeholders) and (2) a community 
committed to making the investment and collaborating towards the implementation 
of the agendas. This objective was reached in collaboration with the NESSI 
technology platform with the launch of the Big Data Value Association (BDVA) 
and the Big Data Value contractual Public Private Partnership (BDV cPPP) within 
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Horizon 2020. The BDVA and the BDV cPPP provide the necessary framework 
that guarantees the industrial leadership, investment, and commitment of both the 
private and the public side to build a data-driven economy across Europe. The 
strategic objective of the BDV cPPP is to master the generation of value from big 
data and create a significant competitive advantage for European industry that will 
boost economic growth and jobs. 
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